Humanities Data Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	4
Dung lượng	75,31 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 113 — #36 Exploring Texts Using the Vector Space Model • 113 undertones,2 while newer examples put greater e[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 113 — #36 Exploring Texts Using the Vector Space Model undertones,2 while newer examples put greater emphasis on superstitious beliefs VanArsdale (2019) points to an interesting development of the postscript “It works!” The first attestation of this phrase is in 1979, but in a few years’ time, all succeeding letters end with this statement Extract and print the summed frequency of the words Jesus and works in letters written before and written after 1950 Challenging Compute the cosine distance between the oldest and the youngest letter in the corpus Subsequently, compute the distance between two of the oldest letters (any two letters from 1906 will do) Finally, compute the distance between the youngest two letters Describe your results Use SciPy’s pdist() function to compute the cosine distances between all letters in the corpus Subsequently, transform the resulting condensed distance matrix into a regular square-form distance matrix Compute the average distance between letters Do the same for letters written before 1950, and compare their mean distance to letters written after 1950 Describe your results The function pyplot.matshow() in Matplotlib takes a matrix or an array as argument and plots it as an image Use this function to plot a squareform distance matrix for the entire letter collection To enhance your visualization, add a color bar using the function pyplot.colorbar(), which provides a mapping between the colors and the cosine distances Describe the resulting plot How many clusters you observe? 3.5 Appendix: Vectorizing Texts with NumPy Readers familiar with NumPy may safely skip this section NumPy (short for Numerical Python) is the de facto standard library for scientific computing and data analysis in Python Anyone interested in large-scale data analyses with Python is strongly encouraged to (at least) master the essentials of the library This section introduces the essentials of constructing arrays (section 3.5.1), manipulating arrays (section 3.5.2), and computing with arrays (section 3.5.3) A complete account of NumPy’s functionalities is available in NumPy’s online documentation 3.5.1 Constructing arrays NumPy’s main workhorse is the N-dimensional array object ndarray, which has much in common with Python’s list type, but allows arrays of numerical data to be stored and manipulated much more efficiently NumPy is conventionally imported using the alias np: The luck chain letter is generally believed to stem from the “Himmelsbrief” (Letter from Heaven), which might explain these religious undertones • 113 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 114 — #37 114 • Chapter import numpy as np NumPy arrays can be constructed either by converting a list object into an array or by employing routines provided by NumPy For example, to initialize an array of floating points on the basis of a list, we write: a = np.array([1.0, 0.5, 0.33, 0.25, 0.2]) Similarly, an array of integers can be created with: a = np.array([1, 3, 6, 10, 15]) A crucial difference between NumPy arrays and Python’s built-in list is that all items of a NumPy array have a specific and fixed type, whereas Python’s list allows for mixed types that can be freely changed (e.g., a mixture of str and int types) While Python’s dynamically typed list provides programmers with great flexibility, NumPy’s fixed-type arrays are much more efficient in terms of both storage and manipulation The data type of an array can be explicitly controlled for by setting the dtype argument during initialization For example, to explicitly set the data type for array elements to be 32-bit integers (sufficient for counting words in virtually all human-produced texts), we write the following: a = np.array([0, 1, 1, 2, 3, 5], dtype='int32') print(a.dtype) int32 The trailing number 32 in int32 specifies the number of bits available for storing the numbers in an array An array with type int8, for example, is only capable of expressing integers within the range of -128 to 127 int64 allows integers to fall within the range -9,223,372,036,854,775,807 to 9,223,372,036,854,775,807 (Python’s native int has no fixed bounds.) The advantage of specifying data type is that doing so saves memory The memory needed to store an integer of type int8 amounts to a single byte, whereas those of type int64 need bytes Such a difference might seem negligible, but once we start working with arrays which record millions or billions of term frequencies, the difference will be significant As with integers, we can specify a type for floating numbers, such as float32 and float64 Besides having a smaller memory footprint, numbers of type float32 have a smaller precision than float64 numbers To change the data type of an existing array, we use the method ndarray.astype(): a = a.astype('float32') print(a.dtype) float32 NumPy arrays are explicit about their dimensions, which is another important difference between NumPy’s array and Python’s list object The number of dimensions of an array is accessed through the attribute ndarray.ndim: “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 115 — #38 Exploring Texts Using the Vector Space Model a = np.array([0, 1, 1, 2, 3, 5]) print(a.ndim) To construct a two-dimensional array, we pass a sequence of ordered sequences (i.e., a list or a tuple) to np.array: a = np.array([[0, 1, 2], [1, 0, 2], [2, 1, 0]]) print(a.ndim) Likewise, a sequence of sequences of sequences produces a three-dimensional array: a = np.array([[[1, 3, 3], [2, 5, 2]], [[2, 3, 7], [4, 5, 9]]]) print(a.ndim) In addition to an array’s number of dimensions, we can retrieve the size of an array in each dimension using the attribute ndarray.shape: a = np.array([[0, 1, 2, 3], [1, 0, 2, 6], [2, 1, 0, 5]]) print(a.shape) (3, 4) As can be observed, for an array with rows and columns, the shape will be (3, 4) Note that the length of the shape tuple corresponds to the number of dimensions, ndim, of an array The shape of an array can be used to compute the total number of items in an array, by multiplying the elements returned by shape (i.e., rows times columns yields 12 items) Having demonstrated how to create NumPy arrays on the basis of Python’s list objects, let us now illustrate a number of ways in which arrays can be constructed from scratch using procedures provided by NumPy These procedures are particularly useful when the shape (and type) of an array is already known, but its actual contents are yet unknown In contrast with Python’s list, NumPy arrays are not intended to be resized, because growing and shrinking arrays is an expensive operation Fortunately, NumPy provides a number of functions to construct arrays of a predetermined size with initial placeholder content First, we will have a look at the function numpy.zeros(), which creates arrays filled with zeros (of type float64 by default): print(np.zeros((3, 5))) array([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) • 115 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 116 — #39 116 • Chapter The shape parameter of numpy.zeros() determines the shape of the constructed array When shape is a single integer, a one-dimensional array is constructed: print(np.zeros(10)) array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) The function numpy.ones() and numpy.empty() behave in a similar manner, with numpy.ones() creating arrays full of ones and numpy.empty() creating arrays as quickly as possible with no guarantee about their content print(np.ones((3, 4), dtype='int64')) array([[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]) print(np.empty((3, 2))) array([[0.0e+000, 4.9e-324], [4.9e-324, 9.9e-324], [1.5e-323, 2.5e-323]]) Should an array filled with randomly generated values be desired, NumPy’s submodule numpy.random() implements a rich variety of functions for producing random contents Here, we demonstrate a function to sample random floating-point numbers in the interval to The function works the same as before, and produces either one-dimensional or multidimensional arrays depending on the size parameter: print(np.random.random_sample(5)) array([0.59069989, 0.66125295, 0.84899624, 0.66321875, 0.62405594]) print(np.random.random_sample((2, 3))) array([[0.38960553, 0.93494862, 0.34722036], [0.31784036, 0.3871856 , 0.36851059]]) NumPy’s counterpart of Python’s range function is numpy.arange(), which produces sequences of numbers as array objects An interesting difference between range and numpy.arange() is that the latter accepts floats as arguments, which enables us to easily create floating-point sequences like the following: a = np.arange(0, 2, 0.25) print(a) array([0 , 0.25, 0.5 , 0.75, , 1.25, 1.5 , 1.75]) ... storage and manipulation The data type of an array can be explicitly controlled for by setting the dtype argument during initialization For example, to explicitly set the data type for array elements...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:01 — page 114 — #37 114 • Chapter import numpy as np NumPy... 9,223,372,036,854,775,807 (Python’s native int has no fixed bounds.) The advantage of specifying data type is that doing so saves memory The memory needed to store an integer of type int8 amounts

Ngày đăng: 20/11/2022, 11:26