Humanities Data Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	3
Dung lượng	66,75 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 01 — page 117 — #40 Exploring Texts Using the Vector Space Model • 117 3 5 2 Indexing and slicing arrays Indexing and[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 117 — #40 Exploring Texts Using the Vector Space Model 3.5.2 Indexing and slicing arrays Indexing and slicing NumPy arrays behaves similarly to accessing elements in Python’s list Accessing a single element from a one-dimensional array can be done by specifying its corresponding index within square brackets: a = np.arange(10) print(a[5]) Similarly, an array can be sliced to retrieve a sub-array, just as with Python’s list: print(a[3:8]) array([3, 4, 5, 6, 7]) The strength of NumPy arrays becomes more evident in the context of multidimensional arrays While Python’s list and NumPy’s one-dimensional arrays allow for only a single index (or slice), multidimensional arrays allow for a (slice) index per dimension (sometimes called axis), separated by commas This syntax provides a powerful mechanism to index and manipulate arrays Let us start with a simple example In the following code block, we retrieve the frequency of the word monsieur (sir) from the third document.3 This is done by providing two indexes separated by a comma, of which the first corresponds to the row index of the third document, and the second points to the column of the word monsieur: word_index = vocabulary.index('monsieur') document_term_matrix = np.array(document_term_matrix) print(document_term_matrix[2, word_index]) 17 Note that the order of these indexes corresponds to the shape of the in which the value at the first index indicates the number of documents, and the value in the second position counts the size of the vocabulary To retrieve the frequency of a given word for a sequence of documents, we use the Python slice convention in the first position The following line retrieves an array consisting of the frequencies of monsieur in the first five documents of the document-term matrix: document_term_matrix, print(document_term_matrix[:5, word_index]) array([ 9, 0, 17, 9, 11]) Here, we assume that you have executed all code in the chapter above up until (and including) the first code block under ‘Exploring the corpus,’ so that you have the object document_term_matrix available • 117 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 118 — #41 118 • Chapter Here, the left-hand side of the comma specifies a slice (i.e., the first five rows), and the index to the right of the comma indicates the column index (corresponding to monsieur) Similarly, to construct an array with frequencies for a number of specific columns, we can also use a slice index Consider the following indexing operation, which constructs an array with counts corresponding to the words in columns 10 to 40 for the sixth document: print(document_term_matrix[5, 10:40]) array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) To access all rows of a particular column (or collection of columns), we write the following: column_values = document_term_matrix[:, word_index] The same mechanism can be used to access all columns of a particular row (or collection of rows), as shown by the following: print(document_term_matrix[5, :]) array([0, 0, 0, , 0, 0, 0]) When an array is indexed with less indexes than the array has dimensions, NumPy assumes the missing indexes to be complete This is why the following less verbose (and common) notation is equivalent to the previous example: print(document_term_matrix[5]) array([0, 0, 0, , 0, 0, 0]) In addition to indexing by integers and slices, NumPy offers a number of “fancy” indexing techniques (“fancy” is, indeed, the common term for this form of indexing) We will demonstrate two of them: (i) sequence indexing, and (ii) boolean indexing Sequence indexing is particularly useful when accessing discontinuous elements from an array For example, to construct an array with word counts for a few discontinuous documents, a sequence of integers is given as a row index: print(document_term_matrix[(1, 8, 3), :]) array([[0, 0, 0, , 0, 0, 0], [0, 0, 0, , 0, 0, 0], [0, 0, 0, , 0, 0, 0]]) In a similar vein as the previous example, we can create a reduced array consisting of only a few columns The following example shows how to construct a reduced array with word counts for the words monsieur, madame, and amour: words = 'monsieur', 'madame', 'amour' word_indexes = [vocabulary.index(word) for word in words] print(document_term_matrix[:, word_indexes]) “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:01 — page 119 — #42 Exploring Texts Using the Vector Space Model array([[ 9, 3, 1], [ 0, [17, , 0, 4, 3], 0], [ 4, 35, 7], [ 0, 1, 11], [ 0, 31, 15]]) We conclude this section with one final fancy indexing technique, boolean indexing Say we are interested in all plays in which the word de occurs Using pure Python, we could solve this problem by iterating over all rows in document_term_matrix (see above) using a for loop, and check for each row if the column corresponding to de has a frequency higher than zero Unfortunately, this strategy is rather inefficient and slow, especially for large lists of numbers NumPy provides a much more efficient solution through its use of so-called “vectorized operations.” But before we explain this solution, we first need to discuss the concept of vectorized operations Consider the following list of numbers: numbers = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55] Imagine we want to update this list by multiplying each number by 10 In pure Python, a simple way to accomplish this is by means of a list comprehension, as shown in the following code block: print([number * 10 for number in numbers]) [0, 10, 10, 20, 30, 50, 80, 130, 210, 340, 550] Using NumPy’s optimized vectorization mechanism, this can be rewritten to: numbers = np.array(numbers) print(numbers * 10) array([ 0, 10, 10, 20, 30, 50, 80, 130, 210, 340, 550]) With this notation, Python’s for-loop is replaced with an optimized operation written using a lower-level programming language such as C The performance difference between pure Python and NumPy for this specific example may be barely noticeable However, the performance difference becomes increasingly important for larger lists of numbers IPython’s “magic command” %timeit enables us to conveniently time the speed of execution of a particular piece of code Let us time the execution of multiplying a list of a million numbers by 10: numbers = list(range(1000000)) %timeit [number * 10 for number in numbers] The exact execution times may fluctuate from machine to machine, but execution times of the above example typically fall in the range of milliseconds The timing for the same computation with NumPy’s vectorized operations returns a much smaller number best described using microseconds: • 119 ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:01 — page 118 — #41 118 • Chapter Here, the left-hand side... [vocabulary.index(word) for word in words] print(document_term_matrix[:, word_indexes]) “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:01 — page 119 — #42 Exploring Texts Using the Vector Space

Ngày đăng: 20/11/2022, 11:26