Humanities Data Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	3
Dung lượng	166,07 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 00 — page 33 — #2 Parsing and Manipulating Structured Data • 33 Figure 2 1 Network of Hamlet characters Characters mus[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 33 — #2 Parsing and Manipulating Structured Data Figure 2.1 Network of Hamlet characters Characters must interact at least ten times to be included 2.2 Plain Text Enormous amounts of data are now available in a machine-readable format Much of this data is of interest to researchers in the humanities and interpretive social sciences Major resources include Project Gutenberg,2 Internet Archive,3 and Europeana.4 Such resources present data in a bewildering array of file formats, ranging from plain, unstructured text files to complex, intricate databases Additionally, repositories differ in the way they organize the access to their collections: organizations such as Wikipedia provide nightly dumps5 of their databases, downloadable by users to their own machines Other institutions, such as Europeana,6 provide access to their data through an Application Programming Interface (API), which allows interested parties to search collections using specific queries Accessing and dealing with pre-existing data, instead of creating it yourself, is an important skill for doing data analyses in the humanities and allied social sciences Digital data are stored in file formats reflecting conventions which enable us to exchange data One of the most common file formats is the “plain text” format, where data take the form of a series of human-readable characters In https://www.gutenberg.org/ https://archive.org http://www.europeana.eu/portal/en https://dumps.wikimedia.org/backup-index.html http://www.europeana.eu/ • 33 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 34 — #3 34 • Chapter Python, we can read such plain text files into objects of type str The chapter’s data are stored as a compressed tar archive, which can be decompressed using Python’s standard library tarfile: import tarfile tf = tarfile.open('data/folger.tar.gz', 'r') tf.extractall('data') Subsequently, we read a single plain text file into memory, using: file_path = 'data/folger/txt/1H4.txt' stream = open(file_path) contents = stream.read() stream.close() print(contents[:300]) Henry IV, Part I by William Shakespeare Edited by Barbara A Mowat and Paul Werstine with Michael Poston and Rebecca Niles Folger Shakespeare Library http://www.folgerdigitaltexts.org/?chapter=5&play=1H4 Created on Jul 31, 2015, from FDT version 0.9.2 Characters in the Play ====================== Here, we open a file object (a so-called stream) to access the contents of a plain text version of one of Shakespeare’s plays (Henry IV, Part 1), which we assign to stream The location of this file is specified using a path as the single argument to the function open() Note that the path is a so-called “relative path,” which indicates where to find the desired file relative to Python’s current position in the computer’s file system By convention, plain text files take the txt extension to indicate that they contain plain text This, however, is not obligatory After opening a file object, the actual contents of the file is read as a string object by calling the method read() Printing the first 300 characters shows that we have indeed obtained a human-readable series of characters Crucially, file connections should be closed as soon as they are no longer needed: calling close() ensures that the data stream to the original file is cut off A common and safer shortcut for opening, reading, and closing a file is the following: with open(file_path) as stream: contents = stream.read() print(contents[:300]) Henry IV, Part I by William Shakespeare “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 35 — #4 Parsing and Manipulating Structured Data Edited by Barbara A Mowat and Paul Werstine with Michael Poston and Rebecca Niles Folger Shakespeare Library http://www.folgerdigitaltexts.org/?chapter=5&play=1H4 Created on Jul 31, 2015, from FDT version 0.9.2 Characters in the Play ====================== The use of the with statement in this code block ensures that stream will be automatically closed after the indented block has been executed The rationale of such a with block is that it will execute all code under its scope; however, once done, it will close the file, no matter what has happened, i.e., even if an error might have been raised when reading the file At a more abstract level, this use of with is an example of a so-called “context manager” in Python that allows us to allocate and release resources exactly when and how we want to The code example above is therefore both a very safe and the preferred method to open and read files: without it, running into a reading error will abort the execution of our code before the file has been closed, and no strict guarantees can be given as to whether the file object will be closed Without further specification, stream.read loads the contents of a file object in its entirety, or, in other words, it reads all characters until it hits the end of file marker (EOF) It is important to realize that even the seemingly simple plain text format requires a good deal of conventions: it is a well-known fact that internally computers can only store binary information, i.e., arrays of zeros and ones To store characters, then, we need some sort of “map” specifying how characters in plain text files are to be encoded using numbers Such a map is called a “character encoding standard.” The oldest character encoding standard is the ASCII standard (short for American Standard Code for Information Interchange) This standard has been dominant in the world of computing and specifies an influential mapping for a restrictive set of 128 basic characters drawn from the English-language alphabet, including some numbers, whitespace characters, and punctuation (128 (27 ) distinct characters is the maximum number of symbols which can be encoded using seven bits per character.) ASCII has proven very important in the early days of computing, but in recent decades it has been gradually replaced by more inclusive encoding standards that also cover the characters used in other, non-Western languages Nowadays, the world of computing increasingly relies on the so-called Unicode standard, which covers over 128,000 characters The Unicode standard is implemented in a variety of actual encoding standards, such as UTF-8 and UTF-16 Fortunately for everyone—dealing with different encodings is very frustrating—UTF-8 has emerged as the standard for text encoding As UTF-8 is a cleverly constructed superset of ASCII, all valid ASCII text files are valid UTF-8 files Python nowadays assumes that any files opened for reading or writing in text mode use the default encoding on a computer’s system; on macOS and Linux distributions, this is typically UTF-8, but this is not necessarily the case on Windows In the latter case, you might want to supply an extra encoding argument to open() and make sure that you load a file using the proper encoding • 35 ... import tarfile tf = tarfile.open( ''data/ folger.tar.gz'', ''r'') tf.extractall( ''data'' ) Subsequently, we read a single plain text file into memory, using: file_path = ''data/ folger/txt/1H4.txt'' stream...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 34 — #3 34 • Chapter Python, we can read such plain text files into objects of type str The chapter’s data are stored as a... Part I by William Shakespeare “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 35 — #4 Parsing and Manipulating Structured Data Edited by Barbara A Mowat and Paul Werstine with

Ngày đăng: 20/11/2022, 11:27