Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 120 — #43 120 • Chapter 3 numbers = np arange(1000000) %timeit numbers * 10 Number comparisons (e g , 5 < 10[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 120 — #43 120 • Chapter numbers = np.arange(1000000) %timeit numbers * 10 Number comparisons (e.g., < 10) can also be vectorized Say we have a list of numbers, and we want to filter all numbers smaller than 10 In Python, a solution to this problem could be implemented as follows: numbers = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55] print([number for number in numbers if number < 10]) [0, 1, 1, 2, 3, 5, 8] Employing NumPy’s vectorized number comparison operation, we can rewrite this to the following: numbers = np.array(numbers) print(numbers[numbers < 10]) array([0, 1, 1, 2, 3, 5, 8]) How does this work? The part within square brackets (numbers < 10) performs a vectorized comparison operation, which returns a new array with boolean values representing the outcome (i.e., True or False) of the number comparison: print(numbers < 10) array([ True, True, True, True, True, True, True, False, False, False, False]) We can use such a boolean array (a mask) to select from the original array of numbers all elements associated with a True value In other words, using a boolean array, we filter all numbers that pass the conditional expression Let us now return to the problem of filtering the document-term matrix to include only texts in which the word de occurs at least once The boolean indexing mechanism can be employed to retrieve these texts as follows: print(document_term_matrix[ document_term_matrix[:, vocabulary.index('de')] > 0]) array([[0, 0, 0, , 0, 0, 0], [0, 0, 0, , 0, 0, 0], [0, 0, 0, , 0, 0, 0], , [0, 0, 0, , 0, 0, 0], [0, 0, 0, , 0, 0, 0], [0, 0, 0, , 0, 0, 0]]) 3.5.3 Aggregating functions We now proceed with a brief overview of some of the most important functions in NumPy used to aggregate data, including functions for summing over values “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 121 — #44 Exploring Texts Using the Vector Space Model and finding the maximum value in an array Many of these are also provided as built-in functions in Python However, just as with the vectorized operations discussed above, their NumPy counterparts are highly optimized and executed in compiled code, which allows for fast aggregating computations To illustrate the performance gain of utilizing NumPy’s optimized aggregation functions, let us start by computing the sum of all numbers in an array This can be achieved in Python by means of the built-in function sum(): numbers = np.random.random_sample(100000) print(sum(numbers)) 49954.070754395325 Summing over the values in numbers using NumPy is done using the function or the method ndarray.sum(): numpy.sum() print(numbers.sum()) # equivalent to np.sum(numbers) 49954.070754395 While syntactically similar, NumPy’s summing function is orders of magnitude faster than Python’s built-in function: %timeit sum(numbers) %timeit numbers.sum() In addition to being faster, numpy.sum() is designed to work with multidimensional arrays, and, as such, provides a convenient and flexible mechanism to compute sums along a given axis First, we need to explain the concept of “axis.” A two-dimensional array, such as the document-term matrix, has two axes: the first axis (axis=0) runs vertically down the rows, and the second axis (axis=1) runs horizontally across the columns of an array This is illustrated by figure 3.11 Under this definition, computing the sum of each row happens along the second axis: for each row we take the sum across its columns Likewise, computing the sum of each column happens along the first axis, which involves running down its rows Let us illustrate this with an example To compute the sum of each row in the document-term matrix, or, in others words, the document lengths, we sum along the column axis (axis=1): sums = document_term_matrix.sum(axis=1) Similarly, computing the corpus-wide frequency of each word (i.e., the sum of each column) is done by setting the parameter axis to 0: print(document_term_matrix.sum(axis=0)) array([2, 2, 2, , 4, 3, 2]) Finally, if no value to axis is specified, numpy.sum() will sum over all elements in an array Thus, to compute to total word count in the document-term matrix, we write: • 121 ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:01 — page 121 — #44 Exploring Texts Using the Vector Space