CHAPTER 2 METHODOLOGY, DATA, AND GENERAL FINDINGS
2.5.2. Corpus Annotation and Data Processing
2.5.2.2. The process of annotating and processing the data
The clause complexes are grouped according to the types of magazines, so for the 10 types of magazines, there are 10 groups of English clause complexes and 10 groups of Vietnamese clause complexes. The groups of clause complexes are then coded with unique IDs, each ID consists of 3 digits: the first digit represents the language, so there are only two values: 1 for English or 2 for Vietnamese; the other two digits represent the group, so the value ranges from 01 to 10.
The group IDs are as below:
01. Natural sciences: English group: 101, Vietnamese group: 201
02. Healthcare and medicine: English group: 102, Vietnamese group: 202 03. Law: English group: 103, Vietnamese group: 203
04. Economics, Banking, and Finance: English group: 104, Vietnamese group: 204 05. Social Sciences: English group: 105, Vietnamese group: 205
06. ICT: English group: 106, Vietnamese group: 206
07. Art and the Mass Media: English group: 107, Vietnamese group: 207 08. Policies and Political Affairs: English group: 108, Vietnamese group: 208 09. Sports and Entertainment: English group: 109, Vietnamese group: 209 10. Linguistics and Literature: English group: 110, Vietnamese group: 210
68
Before being input from the corpus into SysFan for annotation and processing, the clause complexes are chunked into constituent clauses. The complex divider for clause complexes in SysFan is defaulted as ―|||‖ and the clause divider in the software is defaulted as
―||‖. Accordingly, before being input into the text source, the clause complexes in the groups are separated with ―|||‖ at the two ends as the mark of grammatical boundary and the clauses are chunked with ―||‖.
These 20 groups of clause complexes in the corpus, with the assistance of SysFan, are input into the database as Text for coding and then analyzing. Once all the clause complexes and clauses are separated, SysFan creates the records of clause complexes in one click on ―Create all records‖ and then each clause is automatically assigned with an ID with reference to the
group ID (cf. Fig. 34).
Fig. 34: Chunking a group into clause complexes
The more delicate chunking of clause complexes into clauses is done by clicking the image on the left of the ID of each clause complex. Once the image on the left of a clause complex ID is clicked, that clause complex will be separated into a new window, showing the clause complex in the main text window and the element clauses in the analysis. (cf. Fig.
35)
69
Fig. 35: Chunking a clause complex into clauses
When the clause complexes are chunked into element clauses, the element clauses are also automatically assigned with IDs by the software and can be analyzed into 7 possible levels of dependency between clauses as can be seen in Fig 36 below:
Fig. 36: Seven possible levels of chunking clause complexes
The clauses are then furthered analyzed and annotated with two systems of symbols for their two properties: their dependency status and logico-semantic function in the clause complexes. The symbols for representing these properties are already set in the software and
70
the user just go to the right cell for labeling the clause to be featured, right-click the mouse to see the list of values and then left-click to select the appropriate values to be labeled.
As regards the labels for representing dependency relation, the clauses are labeled 1, 2, 3,… for paratactically related clauses and α, ò, Ɣ, … for hypotactically related clauses.
As regards the labels for logico-semantic relationships, the clauses are labeled differently in expansion and in projection. In expansion, the expanded clauses are not labeled while the expanding clauses are marked ―=‖ for elaborating, “+” for extending, “X” for enhancing. In projection, the projecting clauses are not labeled while the projected clauses are labeled ““” for locution and ―„” for idea.
The process of labeling the element clauses of the clause complexes are as below:
Fig. 37: Labeling the clauses in analysis
The processing of the data into features of the logico-semantic relationships in clause complexes are also possible, with the frequency counted, the name of the logic-semantic relationship labelled to the complex, and a chart to illustrate the relation and distribution of them:
71
Fig.38: Distribution maps of clause complex relation types (Wu, 2000)
The results of data processing are stored in the database for searching and sorting. It can also be extracted into the 20 excel-formatted files with an aim to provide quick and accurate sorting and searching in use. The functions of quickly and accurately sorting and searching for the type of data needed of these excel-formatted files are of great exploitation in the sense that they assist the researcher in classifying, locating, and extracting the data needed for observations of patterns and trenches.