Information Extraction for Financial Analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	32
Dung lượng	1,45 MB

Nội dung

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ CÔNG TRÌNH DỰ THI GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC” NĂM 2012 Tên công trình: Information Extraction for Financial Analysis Họ và tên sinh viên: Lê Văn Khánh Nam, Nữ: Nam Lớp: K53CA Khoa: Công Nghệ Thông Tin Giảng viên hướng dẫn: TS. Phạm Bảo Sơn HÀ NỘI - 2012 Abstract Today, a lot of useful information on the World Wide Web which is usually formatted for its users is difficult to extract relevant data from various sources. Therefore, Information Extraction was born to solve this problem. Recently, flexible Information Extraction (IE) systems that transform the information resources into program-friendly structures such as: a relational database, XML, etc… will become a great necessity. In this report, we present a problem which is applying Information Extraction for Financial Analysis. The main goal is how to extract the information from a thousand of financial reports written in different formats. We also present a systematic approach in building rules to recognize. In particular, we also evaluate the performance of our system. Contents Chapter 1 5 Introduction 5 1.1 Subject Overview 5 1.2 Information Extraction 6 1.3 Report Structure 8 Chapter 2 9 Approaches in Information Extraction 9 2.1 Manually Constructed IE Systems 9 2.1.1 TSIMMIS tool 9 2.1.2 W4F 10 2.1.3 XWRAP 10 2.2 Supervised IE Systems 10 2.2.1 SRV 11 2.2.2 RAPIER 11 2.2.3 WHISK 11 2.3 Semi-Supervised IE Systems 12 2.3.1 IEPAD 12 2.3.2 OLERA 12 2.4 Unsupervised IE Systems 13 2.4.1 RoadRunner 13 2.4.2 DEPTA 14 Chapter 3 16 Our Approach 16 3.1 Problem Formalization 16 3.1.1 HTML Mode 16 3.1.2 Plain-text Mode 18 3.2 Approaches & Implementation 19 3.2.1 HTML Mode 19 3.2.1.1 Preprocessing 20 3.2.1.2 Extracting 22 3.2.1.3 Finalizing 24 Chapter 4 25 Experimental setup and Evaluations 25 4.1 Evaluation Metrics 25 4.2 Corpus Development 25 4.3 Evaluation Criteria 26 4.4 Training Process 27 4.5 Testing Process 28 Chapter 5 30 Conclusion & Future work 30 Reference 31 Chapter 1 Introduction 1.1 Subject Overview Nowadays, the report plays an important role in all fields. Managers can observe the progress of work due to the report. Our problem is how we can deal with financial reports which are written in various formats by different companies. In fact, these reports contain a great deal of information, however, people just need brief but essential information in order to quickly understand what are performed in these report . For example, there is a document like Figure 1.1 , and then outputs as Figure 1.2. As Figure 1.3 shows general scenarios of our work. Figure 1.1. A report Figure 1.2. Excel format Figure 1.3. Scenarios In the scenarios (Figure 1.3), we only concentrate on step 1: processing reports to get such output like Figure 1.2, and then got such ouputs the technical finance will analysis at step 2. To sum up, our task is applying Information Extraction for Financial Analysis to get such output (e.g. Figure 1.2 given above). 1.2 Information Extraction Information extraction (IE) is originally applied to identify desired information from natural language text and convert them into a well-defined structure, e.g., a database with particular fields. With the huge and rapidly increasing amount of available information sources and electronic documents on the world wide web, information extraction is extended for identification from structured and semi-structured web pages. Recently, more and more research groups concentrate their attention on development of information extraction systems, such as: web-mining, question answering. Researches on information extraction could be divided into two subareas: the extraction patterns used for identification of target information from given text, and using machine learning techniques to automatically build such extraction patterns for the sake of avoiding expensive construction by hand. Actually, a lot of information extraction systems have been successfully implemented, and part of them perform very well. To be specific, Figure 1.4 shows an example of information extraction. Given document of seminar announcement, the entities Date, Start-time, Location, Speaker and Topic could be specified. Figure 1.4. Information Extraction for Seminar Announcement Formally, an IE task is defined by its input and its extraction target. The input can be unstructured documents like plain-text that are written in natural language (e.g. Figure 1.4) or the semi-structured documents that are popular on the Web, such as tables or itemized and enumerated lists (e.g. Figure 1.5). Figure 1.5. A Semi-structured page containing data records (in rectangular box) to be extracted. The extraction target of an IE task can be a relation of k-tuple (where k is the number of fields/attributes in a record) or it can be a complex object with hierarchically organized data. For some IE tasks, an attribute may have zero (missing) or multiple instantiations in a record. The IE systems to be also called as extractors or wrappers. As a result, the traditional IE systems usually use some main approaches as: rules-based, machine learning and pattern mining techniques to exploit the information. 1.3 Report Structure Our report is organized as following. First, in Chapter 2, we introduce IE systems in information extraction domain and also review some of the solutions that have been proposed. In the next Chapter 3, we then describe our approach and system implementation. Chapter 4 describes the experiment we carry out to evaluate the quality of our approach. Finally, Chapter 5 is conclusion and our future work. Chapter 2 Approaches in Information Extraction As we found out that earlier IE systems are designed to facilitate programmers in writing extraction rules, while later IE systems take machine learning to generate automatically rules generalization. Such systems have different degree of automation and accuracy. Therefore, the IE systems can be classified into the four classes: manually-constructed IE Systems, supervised IE Systems, semi-supervised IE Systems and unsupervised IE Systems. 2.1 Manually Constructed IE Systems In manually-constructed IE systems, users create a wrapper for each input by hand using general programming languages such as Java, Python, Perl, etc… or by using special designed languages. Hence, these tools require expert developer to have substantial computer and programming backgrounds, so it becomes expensive. Such systems include TSIMMIS [1], W4F [2] and XWRAP [3]. 2.1.1 TSIMMIS tool The main component of this tool is a wrapper that takes as input a specification file that declaratively states. For example, Figure 2.1(a) shows the specification file. Figure 2.1. A TSIMMIS specification file and (b) the OEM output. Each command is of the form: [variables, source, pattern], where source specifies the input text to be considered, pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results. The special symbol “*” in a pattern means discard, and “#” means save in the variables. TSIMMIS then outputs data in Object Exchange Model (e.g. Figure 2.1(b)) that contains the extracted data together with information about the structure and the contents of the result. TSIMMIS provides two important operators: split and case. The split operator is used to divide the input list element into individual element. The case operator allows user to handle the irregularities in the structure of the input pages. 2.1.2 W4F W4F stand for Wysiwyg Web Wrapper Factory which is Java toolkit to generate Web wrappers. The wrapper development process consists of three independent layers: retrieval, extraction and mapping layers. In the retrieval layer, a document is retrieved (from the Web through HTTP protocol), cleaned up and then parse into a tree following the Document Object Model (DOM tree) [5]. In the extraction layer, extraction rules are applied on the DOM tree to extract information and then store them into internal format called Nested String List (NSL). In the mapping layer, the NSL structures are exported to the upper-level application according to mapping rules. Extraction rules are expressed using the HEL (HTML Extraction Language), which uses the HTML parse tree (i.e. DOM tree) path to address the data to be located. For example, users can use regular expression to match or split (following the programming language syntax) the string obtained by DOM tree path. 2.1.3 XWRAP The wrapper generation process includes 2 phases: structure analysis and source-specific XML generation. Firstly, XWRAP fetches, cleans up, and generates a tree, and then identifies regions, semantic tokens. The second phase, the system generates a XML template file based on the content tokens and the nesting hierarchy specification and then constructs a source-specific XML generator. It requires users’ understanding of HTML parse tree; the identification of the separating tags for rows and columns in a table, etc. Actually, it use mainly extraction rules based-on DOM tree to extract, no learning algorithm is used here. 2.2 Supervised IE Systems Supervised WI systems take a set of inputs labeled with examples of the data to be extracted and output a wrapper. The user provides an initial set of annotated examples to train system. For such systems, general users instead of programmers can be trained to use the labeling GUI, thus reducing the cost of wrapper generation: SRV[4], RAPIER[6], WHISK [12]. [...]... output for this situation? In order to answer the question, see Figure 3.2 and such information should be extracted together is that a tuples look like: < > Figure 3.2 Extracted Information. .. occurrence of a repeat and end before the start of next occurrence 2.3.2 OLERA OLERA acquires a rough example from the user for extraction rule generation It can learn extraction rules for pages containing single data records OLERA consists of 3 main operations (1) Enclosing an information block of interest: where the user marks an information block containing a record to be extracted for OLERA to discover other... similar blocks and generalize them to an extraction pattern (using multiple string alignment technique) (2) Drilling-down/rolling up an information slot: drilling-down allows the user to navigate from a text fragment to more detailed components, whereas rolling-up combines several slots to form a meaningful information unit (3) Designating relevant information slots for schema specification as in IEPAD... metrics at the present day Recall in information extraction indicates the proportion of the correct extractions from all possible extractions, i.e all positive instances in the training set, while precision denotes the proportion of the correct extractions from the extracted information Formally, recall and precision are defined as: Figure 4.1 Precision and Recall formula To compute the average of result,... Proceedings of the workshop on Human Language Technology, 1993 [11] http://en.wikipedia.org/wiki /Information_ retrieval#F-measure [12] Soderland, S., Learning information extraction rules for semi-structured and free text Journal of Machine Learning, 34(1-3): 233-272, 1999 [13] Chang, C.-H and Lui, S.-C., IEPAD: Information extraction based on pattern discovery Proceedings of the Tenth International Conference... Engineering (ICDE), San Diego, California, pp 611-621, 2000 [4] Freitag, D., Information extraction from HTML: Application of a general learning approach Proceedings of the Fifteenth Conference on Artificial Intelligence (AAAI-98) [5] http://en.wikipedia.org/wiki/Document_Object_Model [6] Califf, M and Mooney, R., Relational learning of pattern-match rules for information extraction Proceedings of AAAI... talked later on 3.1.1 HTML Mode In this mode, the input contains a lot of information wrapped in HTML tags, but we only one need main information and ignore anyelse The desired information is putted in table tag (e.g Figure 3.1) Common stocks ? 85.92% INFORMATION TECHNOLOGY ? 19.72% Cisco Systems,... field-level extraction but uses bottom-up (compression-based) relational learning algorithm For instance, it begins with the most specific rules and then replacing them with more general rules RAPIER learns single slot extraction patterns that make use of syntactic and semantic information including part-of-speech tagger or a lexicon (WordNet) It also uses templates to lear extraction pattern The extraction. .. Conclusion & Future work We have successfully built completely a system for extracting information from a lot of financial reports written in HTML We presented a systematic approach in building a rule-based system together with processes in annotating a corpus Our system achieves an F-measure of 74.19% for Full Tuple, 81.67% for P tags and 96.26% for Partial Tuple We believe that the system can be further improved... Machine Learning to Discourse Processing Stanford, California, March, 1998 [7] Liu, B., Grossman, R and Zhai, Y., Mining data records in Web pages KDD, 601606, 2003 [8] Zhai, Y and Liu, B Web Data Extraction Based on Partial Tree Alignment Proceedings of the 14th International Conference on World Wide Web (WWW), Japan, pp 76-85, 2005 [9] Line Eikvil Information Extraction from World Wide Web - A Survey . Overview Nowadays, the report plays an important role in all fields. Managers can observe the progress of work due to the report. Our problem is how we can deal with financial reports which are. necessity. In this report, we present a problem which is applying Information Extraction for Financial Analysis. The main goal is how to extract the information from a thousand of financial reports written. In fact, these reports contain a great deal of information, however, people just need brief but essential information in order to quickly understand what are performed in these report . For example,

Ngày đăng: 12/04/2014, 15:41

Xem thêm