Extraction of Disease Events for a Real-time Monitoring System Minh-Tien Nguyen Tri-Thanh Nguyen Hung Yen University of Technology and Education (UTEHY) Knowledge Technology Laboratory (KT-Lab) Vietnam National University, Hanoi (VNUH), University of Engineering and Technology (UET) Knowledge Technology Laboratory (KT-Lab) tiennm@utehy.edu.vn ntthanh@vnu.edu.vn ABSTRACT In this paper, we propose a method that uses both semantic rules and machine learning to extract infectious disease events in Vietnamese electronic news, which can be used in a real-time system of monitoring the spread of diseases Our method contains two important steps: detecting disease events from unstructured data and extracting information of the disease events The event detection uses semantic rules and machine learning to detect a disease event; in the later step, Name Entity Recognition (NER), rules, and dictionaries are used to capture the event’s information The performance of detection step is ≈77,33% (F-score) and the precision of extraction step is ≈91,89% These results are better that those of the experiments in which rules were not used This indicates that our method is suitable for extracting disease events in Vietnamese text Categories and Subject Descriptors H.2.8 [Database Applications]: Data Mining General Terms Data Mining; Information Extraction Keywords Data Mining; Information Extraction; Event Extraction; Disease Event Extraction; Monitoring Systems INTRODUCTION Information from electronic newspapers provide valuable inputs for public health surveillance, early outbreak detection, and disease monitoring systems When the presence of a disease is announced by the government and published on a webpage, it is typically called disease event or an infectious disease outbreak Unfortunately, the electronic resources of infectious diseases are multidimensional, chaotic, and not well organized, so extracting useful patterns from Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for components of this work owned by others than ACM must be honored Abstracting with credit is permitted To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee Request permissions from Permissions@acm.org SoICT’13, December 05-06, 2013, Danang, Viet Nam Copyright 2013 ACM 978-1-4503-2454-0/13/12 $15.00 http://dx.doi.org/10.1145/2542050.2542084 these sources is really challenging "How to detect an infectious disease event?" and "how to extract information of an infectious disease event?" are two important questions which are deeply focused on this paper Disease detection and disease spreading/outbreak monitoring are extremely meaningful issues in society, especially when the diseases are dangerous and have high ability of infection Because an infectious disease normally outbreaks in a short time and spreads very quickly over a large area, so it can bring to emergency circumstances not only for the citizens, but also for the government and economy Therefore, monitoring infectious disease outbreak is really crucial in prevention, handing diseases and helping the authorities to make suitable decisions In this paper, we propose a model to automatically detect and extract information of human infectious disease events from Vietnamese webpages based on semantic rules and machine learning The model includes two important components: disease event detection and disease event extraction In the first component, an infectious disease event is detected from free text, after that, the information of an event (time, disease name, and locations) is extracted in the second component Subsequently, we combine the extracted information to form an infectious disease event This infectious disease events can be the input for our monitoring system for visualization Our paper is organized as follows: related work is in Section 2; our method will be discussed in Section in which event detection is mentioned in Section 3.3 and event extraction is in Section 3.4 Section gives experiments, results, and explains the source of some errors appearing in our research The last section is conclusion RELATED WORK Event extraction was first introduced as an important topic in 1987 in Message Understanding Conference (MUC) [11] In MUC, an event is defined as: "an event must have actor, time, place and impact on the surrounding environment" Later, in Automatic Content Extraction (ACE) program, Doddington George R., et al gave an event definition: "an event is an activity that was created by participants" and divided events into eight types: Life, Movement, Transaction, Business, Conflict, Contact, Personnel and Justice [7] Moreover, as Allan J., et al stated, an event includes four attributes: modality, polarity (Positive, Negative), genericity (Specific, Generic), and tense (Past, Present, Future, Unspecified) [10] Grishman R., et al gave the definition of a disease event as a template: Disease Name, Date, Location, 139 Victim Number, Victim Descriptor, Victim Status, Victim Type, Parent Event [9] Hogenboom F., et al provided a general guideline on how to select a suitable method for event extraction purpose [2] The guideline indicated that event extraction approaches can be listed as data-driven, knowledge-driven, and hybrid Each approach has both advantages and disadvantages Hogenboom F., et al compared the benefits and drawbacks among these methods Finally, the authors pointed out the hybrid approach prevails Event extraction from unstructured text can be applied in many fields, especially in disease domain Grishman R., et al used linguistic event patterns (120 patterns) to analyze sentences to capture information of a disease event [9] These linguistic patterns were built on word classes and relation among them For example, pattern "np (DISEASE) vg (KILL) np (VICTIM)" will match a clause like "Cholera killed 23 inhabitants" An event is recognized based on the trigger of two noun phrases: "outbreak of " and "people died from " These patterns were applied to extract disease events and achieved F-score of ≈53,98% Normally, applying linguistic patterns can achieve high results if these patterns cover the whole dataset, but preparing these patterns is always time-consuming and requires domain experts Moreover, the patterns must be changed when the data fluctuate Finally, because the patterns were built on word classes, so the authors must identify word classes (e.g., noun phrase, verb phrase, etc.), but in some other languages (e.g., Vietnamese or Chinese), this is more challenging Because of this drawback, we not follow this approach Volkova S., et al mixed entity recognition and sentence classification to extract animal disease events [4] The event recognition consists of three main steps: the first step is entity recognition from unstructured texts; secondly, sentences are classified based on these entities; finally, the entities within an event sentence are combined into a structured tuple In the event recognition, true events should contain a disease name and a disease-related verb The authors got the precision of 75% and 65% in event tuple recognition and the sentence classification, correspondingly, with the features from WordNet and Google-Set corpus However, using a list of verbs to confirm an event can badly affect the event extraction in Vietnamese language because the lacking of resources for Natural Language Processing (NLP) (such as Vietnamese WordNet or Google-Set like corpus for Vietnamese) or the performance of parsing utility is not high enough Thus, we not use this method Doan S, et al built a Global Health Monitor system which shows the disease spreading state around the world [5] The system includes three main steps: topic classification, Named Entity Recognition (NER), and disease/location detection Naăve Bayes classifier is used in topic classification with the precision of ≈88,10%, and F-score was ≈76,97% in entity recognition step with Support Vector Machine (SVM), and the final step achieved the precision of ≈93,40% with BioCaster Ontology However, there are some limitations in this system The first limitation is the location ambiguity, because some locations are not mentioned clearly in input data (they are only provinces/cities, lacking of country name), then the system can’t recognize the location exactly Furthermore, BioCaster system can’t detect new diseases or locations that are not in the ontology Our approach uses the advantages of both semantic rule- based method and machine learning in two main components: event detection and event extraction In the event detection, while the semantic rules play the role of a data filter, the classification model distinguishes that a news article contains an event or not Because our rules are used as a filter, so it is simpler than those in the research of Grishman R., et al [9] A rule in our study is a short phrase which is composed of a noun phrase and a verb phrase instead of a complete sentence Moreover, we not use a list of verbs to confirm events as Volkova S., et al [4], because, typically, this method depends on the coverage of verbs and building these verbs always takes much time In the event extraction, our approach is similar to the method of Doan S., et al [5] We use rules, a disease dictionary, a NER, and a location dictionary for extracting information of a disease event In addition, there are several systems which extract events from online news Grishman R., et al built Proteus-BIO system where users can follow infectious diseases [8] Data in this system are collected from webpages and disease reports from World Health Organization and ProMed Collier N., et al made BioCaster system which follows several event types, especially disease events around the world Similarly, HealthMap was built by Freifeld Clark C., et al where users can monitor disease all over the world [6] INFECTIOUS DISEASE EVENT DETECTION AND EXTRACTION 3.1 Infectious Disease Event Characteristics An investigation on our data domain indicates that an infectious disease event may contain a disease name, time, locations, and victims In some cases, it may have additional information such as the methods or the environment of infection Though Grishman R., et al [9] used a disease name, the time and the location of the outbreak, the number of affected victims, and the type of victims as the information of a disease event, we only focus on three basic information: the time, locations of the outbreak and the infectious name disease We ignore the methods or environment information because we collect data from webpages instead of medical reports, so such information is not clearly mentioned in most cases Moreover, an event in MUC must include an actor [11], in our study, the actor is equivalent to a disease, therefore we use the disease name instead of the actor In addition, a closer examination on disease news articles showed that a disease name is sometimes similar to a symptom, so this is one of the reasons of confusion in the event extraction For example, ‘pneumonia’ is the symptom of ‘bird flu’ (A/H5N1), but it was recognized as a disease in some cases 3.2 Problem Definition The infectious disease event extraction problem can be defined as follows: Input: a news article Output: whether the news article contains an infectious disease event or not? If yes, extract information of the event In our research, an infectious disease event E is defined as http://www.who.int/csr/don/en/ http://www.promedmail.org/ http://born.nii.ac.jp http://www.healthmap.org 140 Figure 1: Steps of disease event extraction Figure 2: Event detector components a tuple that has three elements: E = (1) where name is the name of the infectious disease mentioned in the disease news article; time is the time when the disease outbreaks; and place is a set of locations where the disease appears We propose a process to extract the information of a disease event as illustrated in Figure The extraction process includes five components: The crawler retrieves data from the Internet; the pre-processing component extracts the main content from the web pages returned by the crawler (the detail of this module is described in Section and Table 3); the event detector decides whether a news article containing a disease event or not; the event extractor captures the information of the event in a given news article (if any); finally, the visualization component plots the disease events on an online Geographic Information System (GIS) map In this paper, we strongly focus on two key components: event detector and event extractor that are described in detail in Section 3.3 and Section 3.4 3.3 3.3.1 We carried out a statistic on a large dataset of news articles from "Sức khỏe" (Health) category of "Báo mới" news website to find out a set of frequent words (and phrases) The number of frequent words is 34 and some of the most frequent words are given in Table 1, where the third column counts the number of articles containing the corresponding words in the second column We denote this set as Frequentwords set We recognize that most of news articles contain words in the Frequent-words set relating to a disease event Therefore, our idea is to build semantic rules by combining words in the Frequent-words set for filtering input data purpose As the result, we proposed two patterns named Pattern and Pattern representing all our semantic rules These patterns are showed below: Pattern = noun phrase # verb phrase Filtering Rules As we mentioned above, the event detection component has two modules: a data filter and a classifier, in which the filter uses semantic rules to reduce news articles for later classification We examined the domain data carefully and identified that most of news titles express their main content It means that the title of a news article has enough evidence to trigger the existence of a disease event Therefore, we use rules to filter related disease news articles (2) where noun phrase and verb phrase are in the Frequentwords set The Pattern is illustrated in Example Example 1: bệnh nhân tử vong # nhiễm (died patient # infected) dịch tả # bùng phát (cholera # outbreaked) Event Detector The goal of Event Detector is to judge whether there is a disease event from a given news article When a news article is given, it determines whether it contains a disease event (EVENT) or not (NOT_EVENT) by using rules (for title filtering) and machine learning (for classification) The process of event detector is illustrated in Figure Event detector component consists of two modules: a data filter and a classifier The filter module receives data from the pre-process component where HTML tags are removed to get the main content After that, this module filters disease news articles by checking their titles Subsequently, data is transferred into the classifier which distinguishes that a news article contains an event or not Table 1: List of frequent-words Word Articles Nhiễm (infect) 10005 Dịch (disease) 10000 Dương tính (is positive) 5269 Lây lan (spread) 4133 Bùng phát (outbreak) 4039 Tái phát (recurrence) 2514 Ổ bệnh (source of inflection) 2340 Ổ dịch (disease source) 1900 Dịch tả (cholera) 1853 Khử trùng (disinfection) 1143 No 10 Pattern = disease name # verb phrases (3) where: • disease name is retrieved from the BioCaster Ontology [3] and The circular of the Ministry of Health of Vietnam , dated June 24th, 2011; • verb phrases are in the Frequent-words set An example of a sentence matching Pattern is given in Example Example 2: tiêu chảy cấp # nhiễm (acute diarrhea # infected) tiêu chảy cấp # phát (acute diarrhea # discovered) tiêu chảy cấp # lây lan (acute diarrhea # spread) tiêu chảy cấp # bùng phát (acute diarrhea # outbreaked) tiêu chảy cấp # chết (tử vong) (acute diarrhea # died) tiêu chảy cấp # dương tính (acute diarrhea # is positive) Both the two patterns have two elements which are separated by the character "#" We built 43 rules from Pattern by mixing 52 noun phrases and 10 verb phrases Both these http://www.baomoi.com/Home/SucKhoe.epi http://www.baomoi.com http://www.moh.gov.vn/ 141 No 10 Table 2: List of features Feature Dịch tay chân miệng (disease limbs) Tiêu chảy (diarrhea) Trẻ tử vong (the child died) Ổ dịch (disease source) Dương tính (is positive) Dịch cúm gia cầm (bird flu) Ca tử vong (deaths) Bùng phát dịch (outbreak) Dịch cúm (flu) Bệnh nhân tử vong (the patient died) Figure 3: Event extractor component noun phrases and verb phrases are in the Frequent-words set Similarly, we used a disease name and a verb phrase to create a rule following Pattern With 186 disease names from the disease dictionary and verb phrases in the Frequentwords set, the number of rules conforming to Pattern is 186 Some verb phrases in Pattern and Pattern are the same After building the rules set, we had 229 rules in total The related articles are retrieved by these rules and transferred into the classifier 3.3.2 Machine Learning Application The classification model categorizes a news article into either EVENT or NOT_EVENT label The investigation on input data suggests that the title and abstract of a disease news article have enough information to represent its content, therefore these elements are used to create the feature vector In the data preparation step, articles are manually tagged with label (EVENT) and label (NOT_EVENT) After that, features are generated by using 2-grams, 3-grams, and 4-grams As the result, we retrieve 4,552 features which are used for classification Some features are showed in Table We used Maximum Entropy Model as the classifier The news articles which are labeled EVENT will become the input for the Event Extractor component 3.4 Event Extractor Event Extractor is one of two important components where the information of a disease event is extracted The event extraction component is illustrated in Figure Event extraction includes three modules: time extraction, disease extraction, and location extraction The first module uses rules to extract the time information; the second module utilizes a disease dictionary extracting the disease information; and the final module combines NER and a location dictionary to capture place information Finally, we combine the extracted information to form a disease event and store it in an event database 3.4.1 Time Extraction The investigation on dataset suggests that time information can be captured by rule and it is either absolute or relative In the absolute case, the time has the format of DD/MM/YYYY, so we use Regular Expression (RE) to extract it For the relative case, it always contains two elements: a prefix and the time The prefix is a set of words http://www.cs.princeton.edu/maxent that indicates relative time and the time is usually in the Vietnamese date form of DD/MM/YYYY Therefore, we use a rule [1] to calculate the absolute time The time rule is showed in Formula (4) TIME = + (4) where: • RELATIVE TIME = vào (on), ngày (date), sáng (morning), hôm (today), sáng hôm (this morning), chiều (afternoon), hôm qua (yesterday), tối qua (yesterday evening), rạng sáng (early morning), tháng (month) • DATE TIME has the format of DD/MM/YYYY which is either the date expressed in the article content or the published date Example and Example illustrate the use of Regular Expression and the time rule to extract the time information Example 3: “Ngày 12/03/2012, Bộ Y tế công bố dịch cúm A H5N1 tái phát Quảng Ngãi.” (On March 12th , 2012, Ministry of Health announced the A H5N1 flu had hit Quang Ngai) Example 4: “Sáng ngày 15/01/2012, Sở Y tế Hà Nội thông báo bệnh nhân nhiễm cúm A/H5N1 tử vong” (In the morning of January 15th , 2012, Hanoi Health Department announced the first patient who had infected with A/H5N1 flu died) The time information in Example is captured by the Regular Expression while it is extracted by Formula (4) in Example As the result, the time information in Example is March 12th , 2012, whereas it is In the morning of January 15th , 2012 in Example 3.4.2 Disease Extraction Disease extraction is the second module which captures the disease name As we mentioned in Figure 1, the preprocessing component tokenizes and word-segments the content of articles As the result, each article has a list of words These words are input for this module Disease extraction module uses a disease dictionary including 186 disease names for the extraction purpose 142 The extraction process can be described in two steps: finding the longest phrase that can be a name candidate, and matching the candidate with the original article to check whether it is a correct name The finding process uses the longest matching method to match a word (in an article) with a disease name (from the disease name dictionary) If a disease name contains a given word, then it is probably the disease name candidate In the matching process, the candidate is checked whether it appears in the article to ensure it is correct or not The correct candidate must appear in the original article The disease extraction process is illustrated through Example Example 5: “Dịch cúm A/H5N1 bùng phát Bến Tre” (A/H5N1 flu outbreaks in Ben Tre) After tokenizing and word-segmenting, we retrieve two words related to disease: cúm (flu) and A/H5N1 The finding step matches these words with the disease dictionary to find out the longest word As the result, with the word of cúm (flu), we retrieve three words: cúm (flu), cúm A/H5N1 (A/H5N1 flu), and cúm gia cầm (bird flu), while with the word of A/H5N1, we only have one name: cúm A/H5N1 (A/H5N1 flu) In the later step, the matching process checks these words against the original article to find out correct result In this example, the longest item is cúm gia cầm (bird flu), but it does not appear in Example So this disease is ignored The second longest word is cúm A/H5N1 (A/H5N1 flu) and the matching process recognizes that it is in the original article So, it is the correct disease name and the value of the disease information is the cúm A/H5N1 (A/H5N1 flu) 3.4.3 Location Extraction Building the final module is more challenging than two previous ones because the ambiguity among locations In fact, several places can have the same proper name (e.g., "Dong Hai" town is a location in both "Tra Vinh" and "Quang Ninh" provinces) Therefore, in some cases, if a news articles does not mention locations clearly, the place information can be confused To deal with this issue, we combined NER and a location dictionary to improve the performance of location extraction Location extraction process can be described in three steps: NER, location extraction, and normalization Firstly, the NER was applied to detect location entities in a given news article As the result, locations in the article are labeled by a pair of and tags Secondly, we extract the locations based on these tags In the final step, each location is normalized by looking up the location dictionary which will be described in detail later We used a location dictionary that is organized as a taxonomy which is showed in Figure 4, where: • T is the abbreviation of the town • C is the abbreviation of the commune In this taxonomy, the highest level is the root node; level represents 63 provinces; 692 districts are in level 2; and 11,101 towns and communes are represented by nodes in the level If a phrase inside the and tag is matched with the value of a node, then current node http://jvntextpro.sourceforge.net Figure 4: The location dictionary taxonomy is marked and complete location is the path from the current node to the root node Obviously, this organization is efficient to identify the relation between communes, towns, and provinces and helps to avoid the geo-ambiguity The efficiency of the taxonomy is showed in Example Example 6: “Ngày 12/04/2013, Sở y tế Quảng Ngãi thông báo dịch cúm A H5N1 bùng phát thị trấn Sông Vệ (On April 12th , 2013, Department of Health of Quang Ngai announced a A H5N1 flu outbreak in Song Ve town) This example mentions only the town where the A H5N1 flu outbreaks ("Song Ve" town), while the district and the province are absent In the process of location extraction, this sentence is parsed by the NER, and "Song Ve" is labeled by the and tags, while "Quang Ngai" is recognized as the organization entity () As the common way, after retrieving the location (inside and tags), "Song Ve" should be the location information But "Song Ve" does not have enough information to become a real location on a GIS map, since it is not complete In order to solve this problem, we looked up "Song Ve" in the location taxonomy When the node having this value is found, we traversed from this node to the root node in the taxonomy to extract the complete information (i.e "Song Ve" town, "Tu Nghia" district, and "Quang Ngai" province) This step is called location normalization Finally, the extracted time, disease name, and locations from the article are combined to create an infectious disease event in which the set of locations found in this module comprises the place component of the event The event is stored in an event database which is used for the visualization component in a real-time monitoring system 4.1 EXPERIMENTS AND RESULTS Data Preparation Our data is retrieved from "Báo mới" news website 10 because "Báo mới" automatically crawls a large number of news articles (per day) from most of famous Vietnamese websites, hence, it is a good data source After crawling, we had a dataset (denoted as raw dataset) of 3,842,137 news articles Elements of a news article (after pre-processing step) are showed in Table After crawling the data, we used Pattern (2) and Pattern (3) to filter and got a set of 1,668 disease related news articles We denotes the set of 1,668 articles as Filtered dataset for later reference In our study, experiments are conducted on two important components: Event Detector and Event Extractor, which are 10 http://www.baomoi.com/ 143 Table 3: News article’s elements Element Description Title The title of the article Abstract The short paragraph what summaries articles’ content Published time Time when the news is published It supports for time extraction process Link The URL of the article Content The content of the article Table 4: The error rate of the data filter module Incorrect articles Total Error Rate (%) 175 486 36 described in detail in Section 4.2 and Section 4.3 4.2 Data Filter Evaluation The data filter is the first module in the event detection which filters articles from the data crawler component As we mentioned above, this module uses Pattern (2) and Pattern (3) to filter articles, so the performance of this module depends on the coverage of rules of the two patterns Normally, we must evaluate the precision of Pattern and Pattern on the whole dataset (about 3,842,137 news articles), but this approach is very costly because we have to label them manually To evaluate the performance of this module, we randomly selected 486 articles from raw dataset to manually check the error rate The error rate was calculated using the Formula (5), and the results are showed in Table The results show that the error rate is high or the accuracy is low due to the fact that it filtered all the articles related to diseases in which a large number of articles did not present disease events (the detail of this issue will be discussed in Section 4.4) We accept this to gain high recall, and the overall performance will be improved by subsequent phases ErrorRate = #incorrect total (5) where: • #incorrect is the number of articles which are not related to disease • total is total number of articles 4.2.2 Fold 10 Avg P 80,56 72,13 81,90 79,73 73,94 69,95 73,58 71,33 72,37 75,26 75,07 R 87,88 75,86 84,31 84,29 81,88 73,34 75,73 80,24 76,92 77,15 79,76 F-1 84,06 73,95 83,09 81,95 77,71 71,60 74,64 75,52 74,58 76,19 77,33 P 72,22 73,97 80,00 72,92 75,14 70,89 71,76 70,00 67,27 69,37 72,35 R 76,47 79,41 83,81 78,36 78,98 76,65 75,20 75,51 80,57 73,48 77,84 F-1 74,29 76,59 81,86 75,54 77,01 73,66 73,44 72,65 73,32 71,36 74,97 Event Detection Evaluation As we mentioned above, the Event Detector has two modules named the data filter and the classifier Therefore, we will evaluate performance of this component based on these two modules 4.2.1 Table 5: The comparison of Experiment a and Experiment b Experiment a Experiment b Classification Evaluation We carried two experiments to evaluate performance of the classification, namely, Experiment a which combines rules and machine learning, and Experiment b which uses only machine learning The measures used to evaluate the performance of this modules are precision, recall, and F-score based on the 10-fold cross validation In the Experiment a, we randomly selected 686 articles from the Filtered dataset and tagged them as EVENT or NOT_EVENT We denoted this set as Experiment a dataset In the Experiment b, we selected 50 more articles from the raw dataset, and added them to Experiment a dataset to form Experiment b dataset After preparing the training dataset, we compare the performance of the two experiments The comparison of two classifiers is showed in Table where the results of Experiment b are in three columns on the right, while the results of the Experiment a are showed in three columns on the left The average of F-score in the two experiments indicates that the F-score of classifier in the Experiment a is better than that of the Experiment b of ≈2,36% The difference between two classifiers is not big because we added only 50 articles into the Experiment a dataset The performance will be much better if we add more raw articles into Experiment b dataset 4.3 Event Extraction Evaluation Because an infectious disease event E is defined as a tuple that includes name, time, and place as given in Formula (1), so a correct event should completely contain all elements When the time of an event is not clearly mentioned in the text, we use the published date of the article as the time of the event In other cases, if a disease event does not include either a disease name or locations, then it is considered to be a false event To evaluate the precision of the event extraction step, we carried out two experiments, namely, Experiment c (abbreviated as Expr c) which uses rules, and Experiment d (abbreviated Expr d) which uses both rules and NER The dataset used in both experiments is 152 news articles which were selected from the articles set returned by the event detector We use three measures Precision (P), Recall (R), and Fscore (F) to compare the performance of the two experiments These measures are denoted by Formula (6), (7), and (8) as following: P = #correct #correct + #incorrect (6) where: • # correct is the number of correct disease events • # incorrect is the incorrect disease events 144 Table 6: The comparison of Experiment c and Experiment d Name Correct Incorrect P R F Expr c 127 25 83,55 92,02 87,58 Expr d 136 16 89,47 94,44 91,89 R= #correct #correct + #not_f ound (7) where: • # correct is the number of correct disease events • # not_found is the number of disease events which the model did not recognize F = 2×P ×R (P + R) (8) Based on the Formula (6), (7), and (8), we compare the precision of Experiment c and Experiment d The comparison is showed in Table 6, where the second row is the result of Experiment c whereas the third row is the result of Experiment d In the Experiment c, the F-score is ≈87,58%, while it is ≈91,89% in the later experiment The result shows that the precision of Experiment d improves by ≈4,31% in comparison with that of Experiment c The cause of the difference between two experiments will be explained in the next section 4.4 Error Analysis and Discussion In the Event Detector component, the results in Table suggest that there is confusion in the data filter module To find out the cause of confusion, we manually checked articles which were selected from the dataset used in Section 4.2.1 The analyzed results indicate that in cases of error, some rules of Pattern (2) and Pattern (3) are not efficient to filter articles The reason is that several topics can share a verb For instance, verb phrase "tử vong" (die) may belong to either disease or treatment topics If this verb appears in an article, the data filter module considers this article related to a disease event, however in fact, it is a treatment topic as illustrated in Example Example 7: Uống thuốc hạ sốt sau 30 phút bệnh nhân tử vong (The patient died after having had the fever medication for 30 minutes) This sentence is captured by a rule of Patten (2) of "bệnh nhân # tử vong" (patient # died), but in fact, the cause of death is related to the medication instead of a disease Moreover, some rules of Pattern (3) (which is a combination of a disease name and a verb phrase) confuse the disease event with a topic related to a disease as showed in Example Example 8: "Phát chủng virus gây bệnh tay chân miệng" (A new strain of virus causing the hand, foot, and mouth disease has been discovered) The rule of Pattern (3) of "tay chân miệng # phát hiện" (hand, foot, and mouth # detect) captures this sentence, but it mentions the discovery of a new virus strain instead of a disease event For Event Extractor component, the results in Table indicate that the precision of Experiment d is ≈5,92% higher than that of the Experiment c At first, we were surprised with the comparative result, because, the Experiment c uses rule-based method to capture information of an event Normally, using rules (knowledge-driven method) often gets highly accuracy To find out the source of errors appearing in the event extraction, we manually checked the incorrect articles in the two experiments (mentioned in Section 4.3) The investigated results are showed in Table and Table 8, respectively The statistic from Table and Table indicates that the cause of errors in both experiments originated from the location extraction and, in some cases, from the diseases extraction In the Experiment c, we recognized that the rules which are used to extract locations did not cover all cases In a few cases, if the location information is abbreviated, then the rules can not recognize them as illustrated in Example Example 9: “Phát trường hợp bệnh nhân nhiễm cúm A H5N1 P.7, Q.8, TP HCM.” (We discovered a patient who infected A H5N1 flu in ward 7, district 8, HCM city) In this example, ward 7, district and Ho Chi Minh city are abbreviated as (P.7, Q.8, TP HCM ), therefore, the rules can not recognize location information In the Experiment d, the main cause that reduced the precision of location extraction is the performance of NER tool In a few cases, the it did not detect locations exactly because the abbreviation of places in articles (similar to the rulebased method) In some other cases, the it mis-recognized a location as an organization as showed in Example 10 Example 10: “Ngày 12/03/2012, dịch tiêu chảy cấp bùng phát Hà Nội, Hải Phòng, Quảng Ninh, Bến Tre, Cần Thơ.” (On December 3rd , 2012, cholera outbreaked in Hanoi, Hai Phong, Quang Ninh, Ben Tre, and Can Tho) In this example, Hanoi, "Hai Phong", "Quang Ninh", "Ben Tre", "Can Tho" are recognized as organizations (tagged with and pairs) which would be ignored during processing In both Experiment c and Experiment d, some extracted disease names were incorrect, because they are not in the disease dictionary Moreover, the disease dictionary contains some names which are equivalent to the symptoms of a disease Thus it makes confusion for the disease extraction module For instance, in the Table 7, a disease name of A/H1N flu in the 89th article is detected as pneumonia, while pneumonia is a symptom of the A/H1N flu In addition, there are some factors which have bad effect to the event extraction Firstly, typo errors of the location in articles reduces the performance of the location extraction For instance, "Đắk Lắk" is written as "Đắc Lắc", but "Đắc Lắc" does not appear in the location dictionary Therefore, the location information can be missed Secondly, if some locations are not described clearly such as “các huyện phía Tây tỉnh Bến Tre” (the western districts of "Ben Tre" province), then the NER utility can not recognize them Finally, another important cause is the geo-ambiguity that reduces the precision of event extraction component In fact, one proper name can be named for several places, if the disease news articles not mention the places clearly, the location information can be confused The geo-ambiguity is showed in Example 11: Example 11: 145 No Doc ID 13 17 10 11 12 13 24 26 32 64 65 79 89 92 96 14 15 105 108 Table 7: The errors in Experiment c (15 of 25 errors) Error Detail Correct Information Extracted Information Congo NULL Kon-Plong Ly District, Pray Veng Province NULL Ward 6, District 8, Ward 14, Ho Chi Minh City District 5, District 8, Ward 7, Binh Thanh District, Hoc Mon District zones 1, Ngo Dong town, Giao Thuy District, Nam Nam Đinh Đinh Hand, foot and mouth Dengue Ward 8, District 5, Ho Chi Minh City Long Bien District Ward 7, District 8, HCM City NULL Typhus Dengue group 3, Tran Hung Dao Ward, Kon Tum City Da Nang Hanoi NULL A/H1N1 flu Pneumonia (Symptom) A/H1N1 flu Tuberculosis Ea T’ling towns and communes: Nam Dong, Tam NULL Thang, D’Dak Rong Cholera Acute diarrhea Tam Quan commune, Tam Dao province, Quan Noi, Tam Quan commune Quan Ngoai, Lang Chanh village, Lang Mau villiage, and Nhan Ly “Ngày 05/10/2012, Sở Y tế Quảng Ninh thông báo phát vi khuẩn tả thị trấn Đông Hải” (On May 10th , 2012, Quang Ninh Department of Health announced the detection of cholera in the Dong Hai town) In this example, "Dong Hai" town is a location in both "Tra Vinh" and "Quang Ninh" provinces, but the article only mentions Dong Hai town, so the module failed to decide whether the disease outbreak was in "Quang Ninh" or "Tra Vinh"? Another error source came from the incomplete recognition of location, i.e only some parts of a location was detected as shown in row of Table (where only the Nam Dinh province was detected), and row 11 of Table (where only Binh Duong province was recognized) The last error source originated from the case in which a location mentioned in the text was not the outbreak place This made the location module misunderstand, and extract the incorrect information as depicted in row of Table and row of Table CONCLUSION In this paper, we introduced our method that combines semantic rules and machine learning to extract disease events in Vietnamese webpages The results of experiments illustrated that our method is suitable for extracting disease events in the Vietnamese Furthermore, we have described briefly our system process, especially we emphasize two key components: Event Detector and Event Extractor We plan to integrate the event database into Vn-Loc system 11 where user can follow some event types: FIRE, CRIMINAL, and TRANSPORT ACCIDENT However, our method needs to have some improvements to enhance the performance in the future Firstly, the coverage of semantic rules and the performance of the Maximum 11 http://vnloc.com/ Entropy classifier must be enhanced by adding useful information Secondly, the precision of event extraction can be increased by improving the performance of NER tool Besides, the geo-ambiguity and the confusion between diseases and symptoms should be improved Finally, relations between disease events should be considered to enhance the quality of the monitoring system REFERENCES [1] Mai-Vu Tran, Minh Hoang Nguyen, Sy-Quan Nguyen, Minh-Tien Nguyen, and Xuan-Hieu Phan "VnLoc: A Real - Time News Event Extraction Framework for Vietnamese" KSE, pp.161-166, 2012 [2] Hogenboom Frederik, et al "An Overview of Event Extraction from Text", Workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE 2011) at Tenth International Semantic Web Conference (ISWC 2011) Vol 779 2011 [3] Collier Nigel, et al "An Ontology-driven System for Detecting Global Health Events" In Proceedings of the 23rd International Conference on Computational Linguistics (pp 215-222) Association for Computational Linguistics [4] Volkova Svitlana, et al "Animal Disease Event Recognition and Classification" Proceedings of the First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010) 2010 [5] Doan S., Hung-Ngo Q., Kawazoe A., and Collier N., "Global Health Monitor - a Web-based System for Detecting and Mapping Infectious Diseases" Proc International Joint Conference on Natural Language Processing (IJCNLP), Companion Volume, Hyderabad, India, January 7-12, pp.951-956, 2008 [6] Freifeld Clark C., et al "HealthMap: Global Infectious 146 No Doc ID 16 17 21 23 25 26 32 39 10 40 45 11 46 12 13 14 47 69 84 15 16 106 109 [7] [8] [9] [10] [11] Table 8: The errors in Experiment d Error Detail Correct Information Extracted Information Thanh Long village, Phuoc My commune, Quy Binh Dinh Nhon City Giao Thuy district, Nam Dinh, A (H5N1) flu Nam Đinh, Flu Me So, Van Giang, Hung Yen Hung Yen Ba Ria - Vung Tau NULL village, Hoa An commune, Krong Pac district, Dak Hoa An Commune, Chiem Hoa district, Tuyen Lak Quang Ward 8, District 5, Ho Chi Minh City (P.8, Q.5, TP NULL HCM) Ward 7, District 8, HCM City (P.7, Q.8, TP HCM) NULL Mo Cay Nam, Mo Cay Bac, Giong Trom, Thanh Ben Tre Phu, Chau Thanh Ba Tri, Cho Lach Ward 6, District (P.6, Q.8) TP HCM Hung Yen, Yen Dinh, Thanh Hoa, Vinh Phuc, Ba Hanoi, Vinh Phuc Dinh, Hanoi Thuan An, Di An, Ben Cat District, Thu Dau Mot Binh Duong Town, Binh Duong Kim Long and Huong Long Ward, Hue City NULL Tan An Hoi Village, Cu Chi District, HCM City NULL Thanh Binh Ward, Hai Chau district, Da Nang city, Ward Thanh Binh, Ninh Binh City, Ninh Binh City Dak Lak Da Nang, Hai Chau District District 7, Tan Binh District District Hoang Mai District, Hai Ba Trung, Thanh Xuan, Hanoi Hoan Kiem District, Thanh Tri, Dong Da, Quang Ninh, Bac Giang, Nam Dinh, Thai Binh, Ha Nam, Hung Yen Disease Monitoring through Automated Classification and Visualization of Internet Media Reports" Journal of the American Medical Informatics Association 15.2 (2008): 150-157 Doddington George R., et al "The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation" LREC 2004 Grishman Ralph, Silja Huttunen, and Roman Yangarber "Real-Time Event Extraction for Infectious Disease Outbreaks" Proceedings of the second international conference on Human Language Technology Research Morgan Kaufmann Publishers Inc., 2002 Grishman Ralph, Silja Huttunen, and Roman Yangarber "Information extraction for enhanced access to disease outbreak reports" Journal of Biomedical Informatics (JBI), Vol 35, No 4, pp.236-246, 2002 Allan James, Ron Papka, and Victor Lavrenko "On-line new event detection and tracking" Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval ACM, 1998 Grishman Ralph, and Beth Sundheim "Message understanding conference-6: a brief history" COLING, Vol 1, pp.466–471, 1996 147 ... in fact, the cause of death is related to the medication instead of a disease Moreover, some rules of Pattern (3) (which is a combination of a disease name and a verb phrase) confuse the disease. .. longest matching method to match a word (in an article) with a disease name (from the disease name dictionary) If a disease name contains a given word, then it is probably the disease name candidate... dictionary extracting the disease information; and the final module combines NER and a location dictionary to capture place information Finally, we combine the extracted information to form a disease