Humanities Data Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	3
Dung lượng	65,37 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 00 — page 48 — #17 48 • Chapter 2 Researchers in the humanities nowadays put a lot of time and effort in cre ating dig[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 48 — #17 48 • Chapter Researchers in the humanities nowadays put a lot of time and effort in creating digital data sets for their research, such as scholarly editions with a rich markup encoded in XML Nevertheless, once data have been annotated, it can be challenging to subsequently extract the textual contents, and to fully exploit the information painstakingly encoded It is therefore crucial to be able to parse XML in an efficient manner Luckily, Python provides the necessary functionality for this In this section, we will make use of some of the functionality included in the lxml library, which is commonly used for XML parsing in the Python ecosystem, although there exist a number of alternative packages It should be noted that there exist languages such as XSLT (Extensible Stylesheet Language Transformations20 ) which are particularly well equipped to manipulate XML documents Depending on the sort of task you wish to achieve, these languages might make it easier than Python to achieve certain transformations and manipulations of XML documents Languages such as XSLT, on the other hand, are less general programming languages and might miss support for more generic functionality 2.6.1 Parsing XML We first import the lxml’s central module etree: import lxml.etree After importing etree, we can start parsing the XML data that represents our sonnet: tree = lxml.etree.parse('data/sonnets/18.xml') print(tree) We have now read and parsed our sonnet via the lxml.etree.parse() function, which accepts the path to a file as a parameter We have also assigned the XML tree structure returned by the parse function to the tree variable, thus enabling subsequent processing If we print the variable tree as such, we not get to see the raw text from our file, but rather an indication of tree’s object type, i.e., the lxml.etree._ElementTree type To have a closer look at the original XML as printable text, we transform the tree into a string object using lxml.etree.tostring(tree) before printing it (note that the initial line from our file, containing the XML metadata, is not included anymore): # decoding is needed to transform the bytes object into an actual string print(lxml.etree.tostring(tree).decode()) day element: rhyme -> temperate element: rhyme -> May element: rhyme -> date element: rhyme -> shines element: rhyme -> dimm'd element: rhyme -> declines element: rhyme -> untrimm'd element: rhyme -> fade element: rhyme -> ow'st element: rhyme -> shade element: rhyme -> grow'st element: rhyme -> see element: rhyme -> thee Until now, we have been iterating over the elements in their simple order of appearance: we haven’t really been exploiting the hierarchy of the XML tree yet Let us see now how to actually navigate and traverse the XML 21 http://www.w3schools.com/xml/xml_xpath.asp • 49 “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 50 — #19 50 • Chapter tree First, we select the root node or top node, which forms the beginning of the entire tree: root = tree.getroot() print(root.tag) sonnet As explained above, the root element in our XML file has two additional attributes The values of the attributes of an element can be accessed via the attribute attrib, which allows us to access the attribute information of an element in a dictionary-like fashion, thus via key-based indexing: print(root.attrib['year']) 1609 Now that we have selected the root element, we can start drilling down the tree’s structure Let us first find out how many child nodes the root element has The number of children of an element can be retrieved by employing the function len(): print(len(root)) 15 The root element has fifteen children, that is: fourteen elements and one element Elements with children function like iterable collections, and thus their children can be iterated as follows: children = [child.tag for child in root] How could we now extract the actual text in our poem while iterating over the tree? Could we simply call the text property on each element? print('\n'.join(child.text or '' for child in root)) Shall I compare thee to a summer's Thou art more lovely and more Rough winds shake the darling buds of And summer's lease hath all too short a Sometime too hot the eye of heaven And often is his gold complexion And every fair from fair sometime By chance, or nature's changing course, But thy eternal summer shall not Nor lose possession of that fair thou Nor shall Death brag thou wander'st in his When in eternal lines to time thou So long as men can breathe or eyes can So long lives this, and this gives life to ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 49 — #18 Parsing and Manipulating Structured Data Sometime too hot the eye of heaven... navigate and traverse the XML 21 http://www.w3schools.com/xml/xml_xpath.asp • 49 “125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 50 — #19 50 • Chapter tree First, we select the root

Ngày đăng: 20/11/2022, 11:27