Electronic Business: Concepts, Methodologies, Tools, and Applications (4-Volumes) P249 ppt

2414 Automatically Extracting and Tagging Business Information for E-Business Systems marketplaces in which they compete. The World Wide Web is a rich but unmanageably huge source of human-readable business information—some novel, accurate, and relevant—some repeti- WLYHZURQJRURXWRIGDWH$VWKHÀRRGRI:HE document tops 11.5 billion pages and continues to rise (Gulli & Signorini, 2005), the human task of grasping the business information it bears seems more and more hopeless. Today’s Really Simple Syndication (RSS) news syndication and aggregation tools provide only marginal relief to information-hungry, document-weary managers and investors. In the envisioned Semantic Web, business information will come with handles (semantic tags) that computers can intelligently grab onto, to perform tasks in the business-to- business (B2B), business-to-consumer (B2C), and consumer-to-consumer (C2C) environments. 6HPDQWLFHQFRGLQJDQGGHFRGLQJLVDGLI¿FXOW problem for computers, however, as any very ex- pressive language, for example, English provides a large number of equally valid ways to represent a given concept. Further, phrases in most natural (i.e., human) languages tend to have a number of different possible meanings (semantics), with the correct meaning determined by context. This is especially challenging for computers. As a stan- GDUGDUWL¿FLDOODQJXDJHHPHUJHVFRPSXWHUVZLOO become semantically enabled, but humans will face a monumental encoding task. For e-busi- QHVVDSSOLFDWLRQVLWZLOOQRORQJHUEHVXI¿FLHQW to publish accurate business information on the Web in, say, English or Spanish. Rather, that information will have to be encoded into the ar- WL¿FLDOODQJXDJHRIWKH6HPDQWLF:HE²DQRWKHU time-consuming, tedious, and error-prone process. Pre-standard Semantic Web creation and editing tools are already emerging to assist early adopters with Semantic Web publishing, but even as the tools and technologies stabilize, many businesses will be slow to follow. Furthermore, a great deal of textual data in the pre-Semantic Web contains YDOXDEOHEXVLQHVVLQIRUPDWLRQÀRDWLQJWKHUH along with the out-dated debris. However, the new Web vessels—automated agents—cannot navigate this old-style information. If the rising sea of human-readable knowledge on the Web is WREHWDSSHGDQGVWUHDPVRILWSXUL¿HGIRUFRP- puter consumption, e-business systems must be developed to process this information, package it, and distribute it to decision makers in time for competitive action. Tools that can automatically extract and semantically tag business information from natural language texts will thus comprise an important component of both the e-business systems of tomorrow, and the Semantic Web of the day after. In this chapter, we give some background on the Semantic Web, ontologies, and the valuable sources of Web information available for e-business applications. We then describe how textual information can be extracted to produce XML ¿OHV DXWRPDWLFDOO\ )LQDOO\ ZH GLVFXVV IXWXUH trends for this research and conclude. BACKGROUND The World Wide Web Consortium (W3C) is lead- ing efforts to standardize languages for knowledge representation on the Semantic Web and is de- veloping tools that can verify that a given document is grammatically correct according to those standards. The XML standard, already widely adopted commercially as a data interchange format, forms the syntactic base for this layered framework. XML is semantically neutral, so the resource description framework (RDF) adds a SURWRFROIRUGH¿QLQJVHPDQWLFUHODWLRQVKLSVEH- tween XML-encoded data components. The Web ontology language (OWL) adds to RDF tools for GH¿QLQJPRUHVRSKLVWLFDWHGVHPDQWLFFRQVWUXFWV (classes, relationships, constraints) still using the RDF-constrained XML syntax. Computers can EHSURJUDPPHGWRSDUVHWKH;0/V\QWD[¿QG RDF-encoded semantic relationships, and resolve meanings by looking for equivalence relation- 2415 Automatically Extracting and Tagging Business Information for E-Business Systems VKLSVDVGH¿QHGE\2:/EDVHGYRFDEXODULHVRU ontologies. Ontologies are virtual dictionaries that for- PDOO\GH¿QHWKHPHDQLQJVRIUHOHYDQWFRQFHSWV Ontologies may be foundational (general), or GRPDLQVSHFL¿FDQGDUHRIWHQVSHFL¿HGKLHUDUFKL- cally, relating concepts to one another via their attributes. As ontologies emerge across the Seman- tic Web, many will overlap, and different terms ZLOOFRPHWRGH¿QHDQ\JLYHQFRQFHSW6HPDQWLF maps will be built to relate the same concepts GH¿QHGGLIIHUHQWO\IURPRQHRQWRORJ\WRDQRWKHU (Doan, Madhavan, Domingos, & Halevy, 2002). Software programs called intelligent agents will be built to navigate the Semantic Web, searching not only for keywords or phrases, but also for concepts semantically encoded into Web documents (Berners-Lee, Hendler, & Lassila, 2001). 7KH\PD\DOVR¿QGVHPDQWLFFRQWHQWE\QHJRWL- ating with semantically enhanced Web services, which Medjahed, Bouguettaya, and Elmagarmid GH¿QHDVVHWV³RIIXQFWLRQDOLWLHVWKDWFDQ be programmatically accessed through the Web” (p. 333). Web services may process information IURPGRPDLQVSHFL¿FNQRZOHGJHEDVHVDQGWKH facts in these knowledge bases may, in turn, be represented in terms of an ontology from the same domain. An important tool for constructing domain models and knowledge-based applications with ontologies is Protégé (n.d.). Protégé is a free, open-source platform. Ontologies are somewhat static, and should be created carefully by domain experts. Knowledge bases, while structurally static, should have dy- namic content. That is, to be useful, especially in the competitive realm of business, they should be continually updated with the latest, best-known information in the domain and regularly purged of knowledge that has become stale or been proven wrong. In business domains, the world evolves quickly, and processing the torrents of information describing that evolution is a daunting task. Much of the emerging information about the business world is published online daily in government UHSRUWV ¿QDQFLDO UHSRUWV VXFK DV WKRVH LQ WKH electronic data gathering, analysis, and retrieval (EDGAR) system database, and Web articles by such sources as the Wall Street Journal (WSJ), Reuters, and the Associated Press. Such sources contain a great deal of information, but in forms that computers cannot use directly. They therefore need to be processed by people before the facts can be put into a database. It is desirable, but impossible for a person, and expensive for a company, to retrieve, read, and synthesize all of the day’s Web news from a given domain and enter the resulting knowledge into a knowledge base to support the company’s decision making for that day. While the protocols and information retrieval technologies of the Web make these articles reachable by computer, they are written for human consumption and still lack the semantic tags that would allow computers to process their FRQWHQWHDVLO\,WLVDGLI¿FXOWSURSRVLWLRQWRWHDFK a computer to correctly read (syntactically parse) natural language texts and correctly interpret (semantically parse) all that is encoded there. H ow e v e r, a u t om at ic a l ly le a r n i n g e v e n s o m e o f t h e daily emerging facts underlying Web news articles could provide enough competitive advantage to justify the effort. We envision the emergence of e-business services, based on knowledge bases fed from a variety of Web news sources, which serve this knowledge to subscribing customers in a variety of ways, including both semantic and nonsemantic Web services. One domain of great interest to investors is that dealing with the earnings performance and fore- FDVWVRIFRPSDQLHV0DQ\¿UPVSURYLGHPDUNHW analyses on a variety of publicly traded corpora- W LR QV +RZH YH USU R¿WP D UJLQVG ULYHWKH L UFKRLF H V of which companies to analyze, leaving over half of t h e 10 , 0 0 0 o r s o pu b l i c ly t r a de d U. S . c o m p a n i e s unanalyzed (Berkeley, 2002). Building tools, which automatically parse the earnings statements of these thousands of unanalyzed smaller companies, and which convert these statements into ;0/IRU:HEGLVWULEXWLRQZRXOGEHQH¿WLQYHVWRUV 2416 Automatically Extracting and Tagging Business Information for E-Business Systems and those companies themselves, whose public exposure would increase, and whose disclosures to regulatory agencies would be eased. A number of XML-based languages and ontologies have been developed and proposed as standards for UHSUHVHQWLQJVXFKVHPDQWLFLQIRUPDWLRQLQWKH¿- nancial services industry, but most have struggled to achieve wide adoption. Examples include News Markup Language (NewsML) (news), Financial products Markup Language (FpML) (derivatives), Investment Research Markup Language (IRML) (investment research), and the Financial Exchange Framework (FEF) Ontology (FEF: Financial Ontology, 2003; Market Data Markup Language, 2000). However, the Extensible Business Markup Language (XBRL), an XML-derivative, has been emerging over the last several years as an HEXVLQHVVVWDQGDUGIRUPDWIRUHOHFWURQLF¿QDQFLDO reporting, having enjoyed early endorsement by such industry giants as NASDAQ, Microsoft, and PricewaterhouseCoopers (Berkeley, 2002). By 2005, the U.S. Securities and Exchange Com- mission (SEC) had begun accepting voluntary ¿QDQFLDO¿OLQJVLQ;%5/WKH)HGHUDO 'HSRVLW Insurance Corporation (FDIC) was requiring XBRL reporting, and a growing number of pub- OLFO\WUDGHGFRUSRUDWLRQVZHUHSURGXFLQJ¿QDQFLDO statements in XBRL (XBRL, 2006). We present a prototype system that uses natural language processing techniques to perform LQIRUPDWLRQH[WUDFWLRQRIVSHFL¿FW\SHVRIIDFWV f r om c o r p o r a t e e a r n i n g s a r t ic l e s o f t h e Wall St reet Journal. These facts are represented in template form to demonstrate their structured nature and converted into XBRL for Web portability. EXTRACTING INFORMATION FROM ONLINE ARTICLES This section discusses the process of generating ;0/IRUPDWWHG¿OHVIURPRQOLQHGRFXPHQWV2XU system, Flexible Information extRaction SysTem (FIRST), analyzes online documents from the WSJ using syntactic and simple semantic analysis (Hale, Conlon, McCready, Lukose, & Vinjamur, 2005; Lukose, Mathew, Conlon, & Lawhead, 2004; Vinjamur, Conlon, Lukose, McCready, & Hale, 2005). Syntactic analysis helps FIRST to detect sentence structure, while semantic analysis helps FIRST to identify the concepts that are represented by different terms. The overall process is shown in Figure 1. This section starts with a discussion of the information extraction literature. Later, we discuss how FIRST extracts information from online documents to produce ;0/IRUPDWWHG¿OHV Information Extraction The explosion of textual information on the Web requires new technologies that can recognize information originally structured for human consumption rather than for data processing. Research in DUWL¿FLDOLQWHOOLJHQFH$,KDVEHHQWU\LQJWR ¿QGZD\VWRKHOSFRPSXWHUVSURFHVVWDVNVZKLFK would otherwise require human judgment. NLP, a sub-area of AI, is a research area that deals with spoken and written human languages. NLP subareas include machine translation, natural language interfaces, language understanding, Figure 1. Information extraction and XML tagging process URL of the document http://www. 2417 Automatically Extracting and Tagging Business Information for E-Business Systems and text generation. Since NLP tasks are very GLI¿FXOWIHZ1/3DSSOLFDWLRQDUHDVKDYHEHHQ developed commercially. Currently, the most successful applications are grammar checking and machine translation programs. To deal with textual data, information systems need to be able to understand the documents they read. Information extraction (IE) research has sought automated ways to recognize and convert information from textual data into more structured, computer-friendly formats, such as display templates or database relations (Cardie, 1997; Cowie & Lehnert, 1996). 0DQ\ EXVLQHVV DUHDV FDQ EHQH¿W IURP ,( research, such as underwriting, clustering, and H[WUDFWLQJLQIRUPDWLRQIURP¿QDQFLDOGRFXPHQWV Some previous IE research prototypes include Sys- tem for Conceptual Information Symmarization, Organziation, and Retrieval (SCISOR) (Jacobs & Rau, 1990), EDGAR-Analyzer (Gerdes, 2003), Edgar2xml (Leinnemann, Schlottmann, Seese, & Stuempert, 2001). Moens, Uyttendaele, and Dumortier (2000) researched the extraction of information from databases of court decisions. The major research organization promoting information extraction technology is the Message Understanding Conference (MUC). MUC’s original goals were to evaluate and support research on the automation and analysis of military messages containing textual information. IE systems’ input documents are normally GRPDLQ VSHFL¿F &DUGLH  &RZLH  /HK- nert, 1996). Generally, documents from the same publisher, reporting stories in the same domain, have similar formats and use common vocabularies for expressing certain types of facts—styles that people can detect as patterns. If knowledge engineers who build computer systems team up ZLWKVXEMHFWPDWWHUH[SHUWVZKRDUHÀXHQWLQWKH information types and expression patterns of the domain, computer systems can be built to look for the concepts represented by these familiar patterns. Humans do this now, but computers will be able to do it much faster. Unfortunately, the extraction process presents P D Q\ G LI ¿ F X OW LH V 2 QHL Q YRO YH V W K H V \ Q W DFW L F V W U XF - ture of sentences, and another involves inferring sentence meanings. For example, it is quite easy IRUDKXPDQWRUHFRJQL]HWKDWWKHVHQWHQFHV³7KH Dow Jones industrial average is down 2.7%” and ³7KH'RZ-RQHVL QGX VW U LDODYHUDJHGLSSHG´ are semantically synonymous, though slightly different. For a computer to extract the same meaning from the two different representations, LWPXVW¿UVWEHWDXJKWWRSDUVHWKHVHQWHQFHVDQG then taught which words or phrases are synonyms. Also, just as children learn to recognize which sentences in a paragraph are the topic or key sentences, computers must also be taught how to recognize which sentences in a text are paramount versus which are simply expository. Once these key sentences are found, the computer programs will extract the vital information from them for inclusion in templates or databases. There are two major approaches to building information extraction systems: the knowledge engineering approach and the automatic training approach (Appelt & Israel, 1999). In the knowledge engineering approach, knowledge engineers employ their own understanding of natural language, along with the domain expertise they extract from subject matter experts, to build rules which allow computer programs to extract information from text documents. With this approach, the grammars are generated manually, and written patterns are discovered by a human expert, analyzing a corpus of text documents from the domain. This becomes quite labor-intensive as the size, number, and stylistic variety of these training texts grows (Appelt & Israel, 1999). Unlike the knowledge engineering approach, the automatic training approach does not require computer experts who know how IE systems work or how to write rules. A subject matter expert annotates the training corpus. Corpus statistics or rules are then derived automatically from the training data and used to process novel data. Since this technique requires large volumes of 2418 Automatically Extracting and Tagging Business Information for E-Business Systems WUDLQLQJGDWD¿QGLQJHQRXJKWUDLQLQJGDWDFDQ EHGLI¿FXOW$SSHOW,VUDHO0DQQLQJ Schutze, 2002). Research using this approach includes Neus, Castell, and Martín (2003). Advanced research in information extraction appears in journals and conferences run by several AI and NLP organizations, such as the MUC, the Association for Computational Linguistics (ACL) (www.aclweb.org/), the Inter national Joint &RQIHUHQFH RQ $UWL¿FLDO ,QWHOOLJHQFH ,-&$, (http://ijcai.org/), and the American Association IRU$UWL¿FLDO,QWHOOLJHQFH$$$,KWWSZZZ aaai.org/). FIRST: Flexible Information extRaction SysTem This section discusses our experimental system ),567),567H[WUDFWVLQIRUPDWLRQIURP¿QDQ- FLDOGRFXPHQWVWRSURGXFH;0/¿OHVIRURWKHU e-business applications. According to Appelt and Israel (1999), the knowledge engineering approach performs best when linguistic resources such as lexicons are available, when knowledge engineers who can write rules are available, and when training data LVVSDUVHDQGH[SHQVLYHWR¿QG%DVHGRQWKHVH constraints, our system, FIRST, employs the knowledge engineering approach. FIRST is an experimental system for extracting semantic facts from online documents. Currently, FIRST works LQWKHGRPDLQRI¿QDQFHH[WUDFWLQJSULPDULO\ from the WSJ. The inputs to FIRST are news articles while the output is the information in an explicit form contained in a template. After the extraction process is completed, this information can be put into a database or converted into an ;0/IRUPDWWHG ¿OH )LJXUH  VKRZV ),567¶V system architecture. FIRST is built in two phases: the build phase and the functional phase. The build phase uses resources such as the training documents and s o m e t o ol s , s u c h a s a K e yWo r d I n C o n t ex t ( K W I C ) index builder (Luhn, 1960), the CMU-SLM toolkit (Clarkson & Rosendfeld, 1997; Clarkson & Rosendfeld, 1999), and a part-of-speech tagger, to analyze patterns in the documents from our area of interest. Through the knowledge engineering process, we learn how the authors of the articles write the stories—how they tend to phrase recurring facts of the same type. We employ these recurring patterns to create rules Figure 2. System architecture of FIRST 2419 Automatically Extracting and Tagging Business Information for E-Business Systems which FIRST uses to extract information from new Web articles. In addition to detecting recurring patterns, we use lexical semantic relation information from WordNet (Fellbaum, 1998; Miller, Beck- with, Fellbaum, Gross, & Miller, 1990; Miller, 1995) to expand the set of keywords to include additional relevant terms that share semantic relationships with the original keywords. The following subsection describes our corpus, the KWIC index generator, the CMU-SLM toolkit, and the part of speech tagger. WordNet, which contains information on lexical semantic relations, is discussed after that. The Corpus and Rule Extraction Process To generate rules that enable FIRST to extract information from online documents, we look for written patterns in a number of articles in the same domain. FIRST’s current goal is to extract information from the WSJ in the domain of FRUSRUDWH¿QDQFH)LJXUHVKRZVDVDPSOHWSJ document published in 1987. We use articles from the WSJ written in 1987 DVDWUDLQLQJGDWDVHWWRKHOSXV¿QGSDWWHUQVLQ the articles. Each article is tagged using Standard Generalized Markup Language (SGML). SGML LVDQLQWHUQDWLRQDOVWDQGDUGIRUWKHGH¿QLWLRQRI device-independent, system-independent meth- ods of representing texts in electronic form. These tags include information about, for example, the document number, the headline, the date, the document source, and the text. Since there are many ways to express the same s e n t e nc e mea n i n g , we h a ve t o l o ok a t a s m a n y p at- terns as possible. We generate a KWIC index to see the relationships between potential keywords and other words in the corpus sentences. $.:,&LQGH[¿OHLVFUHDWHGE\SXWWLQJHDFK ZRUGLQWRD¿HOGLQWKHGDWDEDVH$IWHUWKDWWKH ¿UVWZRUGLVUHPRYHGDQGHDFKUHPDLQLQJZRUGLV VKLIWHGRQH¿HOGWRWKHOHIWLQWKHURZ7KHSURFHVV continues until the last word in the sentence is put LQWRWKH¿UVWSRVLWLRQ)LJXUHVKRZVSDUWRIWKH Figure 3. A sample WSJ document published in 1987 2420 Automatically Extracting and Tagging Business Information for E-Business Systems .:,&LQGH[WKH¿UVWZRUGVIRUWKHVHQWHQFH ³3DWWHQ &RUS VDLG LW LV QHJRWLDWLQJ D SRVVLEOH joint venture with Hearst Corp. of New York and Anglo-French investor Sir James Goldsmith to sell land on the East Coast.” We have generated more than 5 million rows of data. When many sentences are generated in the ¿OHZHORRNDWWKHNH\WHUPVWKDWZHEHOLHYHPD\ be used to express important information—the VSHFL¿FW\SHVRILQIRUPDWLRQZHDLPWRH[WUDFW For example, suppose we believe that the word sale will lead to important information about stock prices, but we are not sure how other words relate to the word sale. We therefore select all the rows in the database that contain the word sale, using the following structured query language (SQL) statement: Select W1, W2, W3, W4, W5 From WSJ_1987 Where W1 like ‘sale%’ Order by W1, W2; Many rows are returned from this SQL statement. Some rows are useful and show interesting patterns but some are not. Figure 5 shows some sample rows that have the word sales appearing in column 1. Using this technique, we are able WR¿QG VHYHUDOSDWWHUQV ZLWKLQZKLFKWKHZRUG sales appears. We also look for patterns using n-gram data produced by the Carnegie Mellon Statistical Language Modeling (CMU-SLM) Toolkit (http:// www.speech.cs.cmu.edu/SLM_info.html). The CMU-SLM toolkit provides several functions, including word frequency lists and vocabularies, )LJXUH6DPSOHURZVIURPD.:,&LQGH[¿OH )LJXUH6DPSOHURZVIURP.:,&LQGH[¿OHZKHUH³VDOHV´DSSHDUVLQFROXPQ 2421 Automatically Extracting and Tagging Business Information for E-Business Systems word bigram and trigram counts, bigram- and trigram-related statistics, and various back off bigram and trigram language models. Table 1 shows some 3-gram and 4-gram data. There are two types of n-gram patterns we are interested in, for example, a word such as sales • Patterns where sales is the ¿UVW term and with n-1 words after it: sales declined 42%, to $53.4 sales declined to $475.6 million • Patterns where sales is the last word, with n-1 words before it: increase of 50% in sales 10% increase in the sales These patterns help us to generate rules for information extraction. The following shows a simple rule that FIRST uses to extract informa- WLRQDERXW¿QDQFLDOLWHPVVXFKDVVDOHVDQGWKHLU UHODWLRQWR¿QDQFLDOstatus words such as increase or decrease. Extraction Rule to Identify Financial Status (increase, decrease…) • ([DPSOHRIDSUR[LPLW\UXOHIRU¿QDQFLDO status: Let n1 be the optimal proximity within which D ¿QDQFLDO LWHP DSSHDUV before ¿QDQFLDO status Let n2 be the optimal proximity within which D¿QDQFLDOLWHPDSSHDUVafter¿QDQFLDOVWD- tus forHDFK¿QDQFLDOLWHPVWDWXVNH\ZRUG if D¿QDQFLDOLWHPLVSUHVHQWZLWKLQn1 words before the keyword or   ¿QDQFLDOLWHPLVSUHVHQWZLWKLQ n2 words after the keyword then consider the keyword as a possible can- GLGDWHIRU¿QDQFLDOVWDWXV end if end for Table 1. Sample 3-grams and 4-grams 2422 Automatically Extracting and Tagging Business Information for E-Business Systems 7KXVIRUWKHVHQWHQFH³sales declined 42%, to ´),567¿OOVWKHVORWVLQWKHWHPSODWHDV Financial Item: sales Financial Status: decline Percentage Change: 42% Change Description: to $53.4 Syntactic Analysis Syntactic analysis helps FIRST to identify the role of each word and phrase in a sentence. It tells whether a word or phrase functions as subject, YHUE REMHFW RU PRGL¿HU 7KH VHQWHQFH ³-LP works for a big bank in NY,” for example, has the following sentence structure, or pars tree. (See Exhibit A.) This parse tree shows that the subject of the VHQWHQFHLV³-LP´WKHYHUELVWKHSUHVHQWWHQVH RI³ZRUN´DQGWKHPRGL¿HULVWKHSUHSRVLWLRQDO SKUDVH³IRUDELJEDQNLQ1<´ In general, most natural language processing systems require a sophisticated knowledge base, the lexicon (a list of words or lexical entries), and a set of grammar rules. While it would be pos- VLEOHWRXVHDIXOOÀHGJHGSDUVHUWRLQFRUSRUDWH grammar rules into a system like ours, we felt that a part-of-speech tagger would be more robust. 6SHFL¿FDOO\ZHXVHWKH/LQJXD(17DJJHUQG as a tool for part-of-speech tagging. Lingua-EN- Tagger is a probability-based, corpus-trained tagger. It assigns part-of-speech tags based on a lookup dictionary and a set of probability values (http://search.cpan.org/dist/Lingua-EN-Tagger/ Tagger.pm). 7KH VHQWHQFH ³&DPSEHOO VKDUHV ZHUHGRZQ $1.04, or 3.5 percent, at $28.45 on the New York Stock Exchange on Friday morning,” for example, is tagged by Lingua-EN-Tagger (n.d.) as: <nnp>Campbell</nnp> <nns>shares</nns> <vbd>were</vbd> <rb>down</rb> <ppd>$</ ppd> <cd>1.04</cd> <ppc>,</ppc> <cc>or</cc> <cd>3.5</cd> <nn>percent</nn> <ppc>,</ppc> <in>at</in> <ppd>$</ppd> <cd>28.45</cd> <in>on</in> <det>the</det> <nnp>New</ nnp> <nnp>York</nnp> <nnp>Stock</ nnp> <nnp>Exchange</nnp> <in>on</in> <nnp>Friday</nnp> <nn>morning</nn> <pp>.</ pp> Here, nnp, for example, indicates a proper noun. The tagged output and knowledge of English grammar help us to identify sentence structure. 7KLVKHOSVXVWROHDU Q³ZKRGRHVZKDWWRZKRP"´ from a sentence, which in turn helps us to extract information more accurately. FIRST uses information from the part-of- speech tagger to identify the sentence structure. This structure is then used in its extraction rules. Some sample FIRST rules using syntactic information, analyzing sentences that contain key terms used in the proximity rule (earlier), are shown next: &DVH  ([DPSOHV ZKHUH ¿QDQFLDO VWDWXV DSSHDUV DIWHU ¿QDQFLDO LWHP²¿QDQFLDO LWHP and status words bolded. <nns>Sales</nns><vbd>rose</vbd> <cd>5.7</ cd> <nn>%</nn> <to>to</to> <ppd>$</ppd> <cd>2.22</cd> <cd>billion</cd> <in>from</in> <ppd>$</ppd> <cd>2.1</cd> <cd>billion</cd> <det>a</det> <nn>year</nn> <in>ago</in> <pp>.</pp> <nnp>Campbell</nnp> <nnp>Soup</ nnp> <nnp>Co.</nnp> <vbd>said</vbd> <nnp>Friday</nnp> <jj>quarterly</jj> <nns>profits</nns> <vbd>were</vbd> MM!ÀDWMM! <pp>.</pp> The following is an example of a rule to identify ¿QDQFLDOVWDWXVIRUWKLVFDVH 2423 Automatically Extracting and Tagging Business Information for E-Business Systems S VP N V PP PNP PREP DET NP ADJ NPP PREP NP NP Jim works for a big bank in NY for each keyword that is a candidate for denoting a ¿QDQFLDOLWHP (e.g., sales) ifWKHWDJJHUKDVLGHQWL¿HGWKDWNH\ZRUGDVD noun or plural noun in the sentence then a form of a corresponding ¿QDQFLDOVWDWXV NH\ZRUGHJ³LQFUHDVH´VKRXOGEHSUHVHQWLQWKH immediately following verb phrase end if end for We maintain a lexicon of candidate keywords IRUGHQRWLQJ¿QDQFLDOLWHPVHJsales, revenue, and SUR¿WV), as well as a lexicon of candidate NH\ZRUGVIRUGHQRWLQJ¿QDQFLDOVWDWXVHJ rise, increase, and decrease). In the previous examples: <vbd>rose</vbd> and YEG!ZHUHYEG!MM!ÀDWMM! form the respective verb phrases. Thus for the ¿UVWVHQWHQFH),567¿OOVWKHORWVDV Financial Item: sales Financial Status: rose )RUWKHVHFRQGVHQWHQFH),567¿OOVWKHORWV as: )LQDQFLDO,WHPSUR¿WV )LQDQFLDO6WDWXVZHUHÀDW Case 2: Examples where status appears be- IRUH¿QDQFLDOLWHP²¿QDQFLDOLWHPDQGVWDWXV words bolded. <det>The</det> <nnp>Camden</nnp> <ppc>,</ppc> <nnp>N.J.</nnp> <ppc>,</ ppc> <nn>company</nn> <vbd>saw</vbd> <jj>strong</jj> <nns>sales</nns> <pp>.</ pp> <det>The</det><nn>company</nn> <vbd>saw</vbd> <det>a</det> <cd>6</cd> <nn>%</nn> <to>to</to> <cd>9</cd> <nn>%</ nn> <nn>sequential</nn> <nn>increase</nn> <in>in</in> <nns>sales</nns> <in>from</in> <jj>last</jj> <nn>quarter</nn> <pp>. Exhibit A. . E-Business Systems word bigram and trigram counts, bigram- and trigram-related statistics, and various back off bigram and trigram language models. Table 1 shows some 3-gram and 4-gram data. There. DV WKRVH LQ WKH electronic data gathering, analysis, and retrieval (EDGAR) system database, and Web articles by such sources as the Wall Street Journal (WSJ), Reuters, and the Associated. tedious, and error-prone process. Pre-standard Semantic Web creation and editing tools are already emerging to assist early adopters with Semantic Web publishing, but even as the tools and technologies

Định dạng
Số trang	10
Dung lượng	448,27 KB