Humanities Data Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	3
Dung lượng	113,66 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 162 — #37 162 • Chapter 4 Figure 4 10 Visualization of the n most unisex names in the data, showing the usag[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 162 — #37 162 • Chapter Figure 4.10 Visualization of the n most unisex names in the data, showing the usage ratio between boys and girls throughout the twentieth century This visualization employs a rolling average to smooth out some of the noise in the curves # Create a figure and subplots fig, axes = plt.subplots( nrows=2, ncols=4, sharey=True, sharex=True, figsize=(12, 6)) # Plot the time series into the subplots d[names].rolling(window=10).mean().plot( color='C0', subplots=True, ax=axes, legend=False, title=names) # Clean up some redundant labels and adjust spacing for ax in axes.flatten(): ax.xaxis.label.set_visible(False) ax.axhline(0.5, ls=' ', color="grey", lw=1) fig.text(0.5, 0.04, "year", ha="center", va="center", fontsize="x-large") fig.subplots_adjust(hspace=0.5) 4.4 Conclusions and Further Reading In what precedes, we have introduced the Pandas library for doing data analysis with Python On the basis of a case study on naming practices in the United States of America, we have shown how Pandas can be put to use to manipulate and analyze tabular data Additionally, it was demonstrated how the time series and plotting functionality of the Pandas library can be employed to effectively analyze, visualize, and report long-term diachronic shifts in historical data Efficiently manipulating and analyzing tabular data is a skill required in many quantitative data analyses, and this skill will be called on extensively in the remaining chapters Needless to say, this chapter’s introduction to the “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 163 — #38 Processing Tabular Data library only scratches the surface of what Pandas can For a more thorough introduction, we refer the reader to the books Python for Data Analysis (McKinney 2012b) and Python Data Science Handbook (Vanderplas 2016) These texts describe in greater detail the underlying data types used by Pandas and offer more examples of common calculations involving tabular datasets Exercises Easy Reload the names dataset (data/names.csv) with the year column as index What is the total number of rows? Print the number of rows for boy names and the number of rows for girl names The method Series.value_counts() computes a count of the unique values in a Series object Use this function to find out whether there are more distinct male or more distinct female names in the dataset Find out how many distinct female names have a cumulative frequency higher than 100 Moderate In section 4.3.2, we analyzed a bias in boys’ names ending in the letter n Repeat that analysis for girls’ names Do you observe any noteworthy trends? Some names have been used for a long time In this exercise we investigate which names were used both now and in the past Write a function called timespan(), which takes a Series object as argument and returns the difference between the Series maximum and minimum value (i.e., max(x) − min(x)) Apply this function to each unique name in the dataset, and print the five names with the longest timespan (Hint: use groupby(), apply(), and sort_values()) Compute the mean and maximum female and male name length (in characters) per year Plot your results and comment on your findings Challenging Write a function which counts the number of vowel characters in a string Which names have eight vowel characters in them? (Hint: use the function you wrote in combination with the apply() method.) For the sake of simplicity, you may assume that the following characters unambiguously represent vowel characters: {'e', 'i', 'a', 'o', 'u', 'y'} Calculate the mean usage in vowel characters in male names and female names in the entire dataset Which gender is associated with higher average vowel usage? Do you think the difference counts as considerable? Try • 163 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 164 — #39 164 • Chapter plotting the mean vowel usage over time For this exercise, you not have to worry about unisex names: if a name, for instance, occurs once as a female name and once as a male name, you may simply count it twice, i.e., once in each category Some initials are more productive than others and have generated a large number of distinct first names In this exercise, we will visualize the differences in name-generating productivity between initials.3 Create a scatter plot with dots for each distinct initial in the data (give points of girl names a different color than points of boy names) The Y axis represents the total number of distinct names, and the X axis represents the number of people carrying a name with a particular initial In addition to simple dots, we would like to label each point with its corresponding initial Use the function plt.annotate() to that Next, create two subplots using plt.subplots() and draw a similar scatter plot for the period between 1900 and 1920, and one for the period between 1980 and 2000 Do you observe any differences? This exercise was inspired by a blog post by Gerrit Bloothooft and David Onland (see https://www.neerlandistiek.nl/2018/05/productieve-beginletters-van-voornamen/) ... reader to the books Python for Data Analysis (McKinney 2012b) and Python Data Science Handbook (Vanderplas 2016) These texts describe in greater detail the underlying data types used by Pandas and... Pandas and offer more examples of common calculations involving tabular datasets Exercises Easy Reload the names dataset (data/ names.csv) with the year column as index What is the total number...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:01 — page 163 — #38 Processing Tabular Data library only scratches the surface of what Pandas can

Ngày đăng: 20/11/2022, 11:30