1. Trang chủ
  2. » Tất cả

Humanities Data Analysis

2 2 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 2
Dung lượng 67,54 KB

Nội dung

Humanities Data Analysis “125 85018 Karsdrop Humanities ch01 3p” — 2020/8/19 — 11 00 — page 46 — #15 46 • Chapter 2 elements The method returns the keys and their counts in the form of (key, value) tu[.]

“125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 46 — #15 46 • Chapter elements The method returns the keys and their counts in the form of (key, value) tuples 2.6 XML In digital applications across the humanities, XML or the eXtensible Markup Language18 is the dominant format for modeling texts, especially in the field of Digital Scholarly Editing, where scholars are concerned with the electronic editions of texts (Pierazzo 2015) XML is a powerful and very common format for enriching (textual) data XML is a so-called “markup language”: it specifies a syntax allowing for “semantic” data annotations, which provide means to add layers of meaningful, descriptive metadata on top of the original, raw data in a plain text file XML, for instance, allows making explicit the function or meaning of the words in documents Reading the text of a play as a plain text, to give but one example, does not provide any formal cues as to which scene or act a particular utterance belongs, or by which character the utterance was made XML allows us to keep track of such information by making it explicit The syntax of XML is best explained through an example, since it is very intuitive Let us consider the following short, yet illustrative example using the well-known “Sonnet 18” by Shakespeare: with open('data/sonnets/18.xml') as stream: xml = stream.read() print(xml) Shall I compare thee to a summer's day? Thou art more lovely and more temperate: Rough winds shake the darling buds of May, And summer's lease hath all too short a date: Sometime too hot the eye of heaven shines, And often is his gold complexion dimm'd; And every fair from fair sometime declines, By chance, or nature's changing course, untrimm'd; But thy eternal summer shall not fade Nor lose possession of that fair thou ow'st; Nor shall Death brag thou wander'st in his shade, When in eternal lines to time thou grow'st; So long as men can breathe or eyes can see, So long lives this, and this gives life to thee. The first line () is a sort of “prolog” declaring the exact version of XML we are using—in our case, that is simply version 1.0 Including a prolog is optional according to the XML syntax but it is a good place to specify additional information about a file, such as its encoding () When provided, the prolog 18 https://www.w3.org/XML/ “125-85018_Karsdrop_Humanities_ch01_3p” — 2020/8/19 — 11:00 — page 47 — #16 Parsing and Manipulating Structured Data should always be on the first line of an XML document It is only after the prolog that the actual content comes into play As can be seen at a glance, XML encodes pieces of text in a similar way as HTML (see section 2.7), using start tags (e.g., , ) and corresponding end tags (, ) which are enclosed by angle brackets Each start tag must normally correspond to exactly one end tag, or you will run into parsing errors when processing the file Nevertheless, XML does allow for “solo” elements, such the tag after line in this example, which specifies the classical “turning point” in sonnets Such tags are “self-closing,” so to speak, and they are also called “empty” tags Importantly, XML tags are not allowed to overlap The following line would therefore not constitute valid XML: Nor shall Death brag thou wander'st in his shade, The problem here is that the element should have been closed by the corresponding end tag (), before we can close the parent element using This limitation results from the fact that XML is a hierarchical markup language: it assumes that we can and should model a text document as a tree of branching nodes In this tree, elements cannot have more than one direct parent element, because otherwise the hierarchy would be ambiguous The one exception is the so-called root element, which is the highest node in a tree Hence, it does not have a parent element itself, and thus cannot have siblings All non-root elements can have as many siblings and children as needed All the elements in our sonnet, for example, are siblings, in the sense that they have a direct parent element in common, i.e., the tag The fact that elements cannot overlap in XML is a constant source of frustration and people often come up with creative workarounds for the limitation imposed by this hierarchical format XML does not come with predefined tags; it only defines a syntax to define those tags Users can therefore invent and use their own tag set and markup conventions, as long as the documents formally adhere to the XML standard syntax We say that documents are “well-formed” when they conform completely to the XML standard, which is something that can be checked using validation applications (see, e.g., the W3Schools validator19 ) For even more descriptive precision, XML tags can take so-called “attributes,” which consist of a name and a value The sonnet element, for instance, has two attributes: the attribute names author and year are mapped to the values "William Shakespeare" and "1609" respectively Names not take surrounding double quotes but values do; they are linked by an equal sign (=) The name and element pairs inside a single tag are separated by a space character Only start tags and standalone tags can take attributes (e.g., ); closing tags cannot According to the XML standard, the order in which attributes are listed is insignificant 19 http://www.w3schools.com/xml/xml_validator.asp • 47 ...“125-85018_Karsdrop _Humanities_ ch01_3p” — 2020/8/19 — 11:00 — page 47 — #16 Parsing and Manipulating Structured Data should always be on the first line of an XML

Ngày đăng: 20/11/2022, 11:27

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN