Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 00 — page 36 — #5 36 • Chapter 2 (e g , open( , encoding=''''utf8'''')) Additionally, files which do not use UTF 8 encoding[.]
“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 36 — #5 36 • Chapter (e.g., open( , encoding='utf8')) Additionally, files which not use UTF-8 encoding can also be opened through specifying another encoding parameter This is demonstrated in the following code block, in which we read the opening line—KOI8-R7 encoded—of Anna Karenina: with open('data/anna-karenina.txt', encoding='koi8-r') as stream: # Use stream.readline() to retrieve the next line from a file, # in this case the 1st one: line = stream.readline() print(line) Все счастливые семьи похожи друг на друга, каждая несчастливая семья несчастлива по-своему Having discussed the very basics of plain text files and file encodings, we now move on to other, more structured forms of digital data 2.3 CSV The plain text format is a human-readable, non-binary format However, this does not necessarily imply that the content of such files is always just “raw data,” i.e., unstructured text In fact, there exist many simple data formats used to help structure the data contained in plain text files The CSV-format we briefly touched upon in chapter 1, for instance, is a very common choice to store data in files that often take the csv extension CSV stands for CommaSeparated Values It is used to store tabular information in a spreadsheet-like manner In its simplest form, each line in a CSV file represents an individual data entry, where attributes of that entry are listed in a series of fields separated using a delimiter (e.g., a comma): csv_file = 'data/folger_shakespeare_collection.csv' with open(csv_file) as stream: # call stream.readlines() to read all lines in the CSV file as a list lines = stream.readlines() print(lines[:3]) [ 'fname,author,title,editor,publisher,pubplace,date', '1H4,William Shakespeare,"Henry IV, Part I",Barbara A [ ]', '1H6,William Shakespeare,"Henry VI, Part 1",Barbara A [ ]', ] This example file contains bibliographic information about the Folger Shakespeare collection, in which each line represents a particular work Each of these lines records a series of fields, holding the work’s filename, author, title, editor, publisher, publication place, and date of publication As one can see, the https://en.wikipedia.org/wiki/KOI8-R “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 37 — #6 Parsing and Manipulating Structured Data first line in this file contains a so-called “header,” which lists the names of the respective fields in each line All fields in this file, header and records alike, are separated by a delimiter, in this case a comma The comma delimiter is just a convention, and in principle any character can be used as a delimiter The tab-separated format (extension tsv), for instance, is another widely used file format in this respect, where the delimiter between adjacent fields on a line is the tab character (\t) Loading and parsing data from CSV or TSV files would typically entail parsing the contents of the file into a list of lists: entries = [] for line in open(csv_file): entries.append(line.strip().split(',')) for entry in entries[:3]: print(entry) ['fname', 'author', 'title', 'editor', 'publisher', 'pubplace', 'date'] [ '1H4', 'William Shakespeare', '"Henry IV', 'Part I"', 'Barbara A Mowat', 'Washington Square Press', 'New York', '1994', ] [ '1H6', 'William Shakespeare', '"Henry VI', 'Part 1"', 'Barbara A Mowat', 'Washington Square Press', 'New York', '2008', ] In this code block, we iterate over all lines in the CSV file After removing any trailing whitespace characters (with strip()), each line is transformed into a list of strings by calling split(','), and subsequently added to the entries list Note that such an ad hoc approach to parsing structured files, while attractively simple, is both naive and dangerous: for instance, we not protect ourselves against empty or corrupt lines lacking entries String variables stored in the file, such as a text’s title, might also contain commas, causing parsing errors Additionally, the header is not automatically detected nor properly handled Therefore, it is recommended to employ packages specifically suited to the task of reading and parsing CSV files, which offer well-tested, flexible, and more robust parsing procedures Python’s standard library, for example, ships with the csv module, which can help us parse such files in a much safer way Have a look at the following code block Note that we explicitly set the delimiter • 37 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 38 — #7 38 • Chapter parameter to a comma (',') for demonstration purposes, although this in fact is already the parameter’s default value in the reader function’s signature import csv entries = [] with open(csv_file) as stream: reader = csv.reader(stream, delimiter=',') for fname, author, title, editor, publisher, pubplace, date in reader: entries.append((fname, title)) for entry in entries[:5]: print(entry) ('fname', 'title') ('1H4', 'Henry IV, Part I') ('1H6', 'Henry VI, Part 1') ('2H4', 'Henry IV, Part 2') ('2H6', 'Henry VI, Part 2') The code is very similar to our ad hoc approach, the crucial difference being that we leave the error-prone parsing of commas to the csv.reader Note that each line returned by the reader immediately gets “unpacked” into a long list of seven variables, corresponding to the fields in the file’s header However, most of these variables are not actually used in the subsequent code To shorten such lines and improve their readability, one could also rewrite the unpacking statement as follows: entries = [] with open(csv_file) as stream: reader = csv.reader(stream, delimiter=',') for fname, _, title, *_ in reader: entries.append((fname, title)) for entry in entries[:5]: print(entry) ('fname', 'title') ('1H4', 'Henry IV, Part I') ('1H6', 'Henry VI, Part 1') ('2H4', 'Henry IV, Part 2') ('2H6', 'Henry VI, Part 2') The for-statement in this code block adds a bit of syntactic sugar to conveniently extract the variables of interest (and can be useful to process other sorts of sequences too) First, it combines regular variable names with underscores to unpack a list of variables These underscores allow us to ignore the variables we not need Below, we exemplify this convention by showing how to indicate interest only in the first and third element of a collection: a, _, c, _, _ = range(5) print(a, c) “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 39 — #8 Parsing and Manipulating Structured Data Next, what does this *_ mean? The use of these asterisks is exemplified by the following lines of code: a, *l = range(5) print(a, l) [1, 2, 3, 4] Using this method of “tuple unpacking,” we unpack an iterable through splitting it into a “first, rest” tuple, which is roughly equivalent to:8 seq = range(5) a, l = seq[0], seq[1:] print(a, l) range(1, 5) To further demonstrate the usefulness of such “starred” variables, consider the following example in which an iterable is segmented in a “first, middle, last” triplet: a, *l, b = range(5) print(a, l, b) [1, 2, 3] It will be clear that this syntax offers interesting functionality to quickly unpack iterables, such as the lines in a CSV file In addition to the CSV reader employed above (i.e., csv.reader), the csv module provides another reader object, csv.DictReader, which transforms each row of a CSV file into a dictionary In these dictionaries, keys represent the column names of the CSV file, and values point to the corresponding cells: entries = [] with open(csv_file) as stream: reader = csv.DictReader(stream, delimiter=',') for row in reader: entries.append(row) for entry in entries[:5]: print(entry['fname'], entry['title']) 1H4 Henry IV, Part I 1H6 Henry VI, Part 2H4 Henry IV, Part 2H6 Henry VI, Part 3H6 Henry VI, Part Readers familiar with programming languages like Lisp or Scheme will feel right at home here, as these “first, rest” pairs are reminiscent of Lisp’s car and cdr operations • 39 ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 37 — #6 Parsing and Manipulating Structured Data first line in this file contains a so-called... c, _, _ = range(5) print(a, c) “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 39 — #8 Parsing and Manipulating Structured Data Next, what does this *_ mean? The use of these... at the following code block Note that we explicitly set the delimiter • 37 “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 38 — #7 38 • Chapter parameter to a comma ('','') for