DSpace at VNU: An upgrading feature-based opinion mining model on Vietnamese product reviews

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	369
Dung lượng	11,25 MB

Nội dung

DSpace at VNU: An upgrading feature-based opinion mining model on Vietnamese product reviews tài liệu, giáo án, bài giản...

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany 6890 Ning Zhong Vic Callaghan Ali A Ghorbani Bin Hu (Eds.) Active Media Technology 7th International Conference, AMT 2011 Lanzhou, China, September 7-9, 2011 Proceedings 13 Volume Editors Ning Zhong Maebashi Institute of Technology, Department of Life Science and Informatics Maebashi-City 371-0816, Japan E-mail: zhong@maebashi-it.ac.jp Vic Callaghan University of Essex, Department of Computer Science Colchester, Essex CO4 3SQ, UK E-mail: vic@essex.ac.uk Ali A Ghorbani University of New Brunswick, Faculty of Computer Science Fredericton, N.B., E3B 5A3, Canada E-mail: ghorbani@unb.ca Bin Hu Lanzhou University, School of Information Science and Engineering Lanzhou, Gansu, 730000, China E-mail: bh@lzu.edu.cn ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-23619-8 e-ISBN 978-3-642-23620-4 DOI 10.1007/978-3-642-23620-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011935218 CR Subject Classification (1998): H.4, I.2, H.3, H.5, C.2, J.1, I.2.11, K.4 LNCS Sublibrary: SL – Information Systems and Application, incl Internet/Web and HCI © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer Violations are liable to prosecution under the German Copyright Law The use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface This volume contains the papers selected for presentation at the 2011 International Conference on Active Media Technology (AMT 2011), jointly held with the 2011 International Conference on Brain Informatics (BI 2011), at Lanzhou University, Lanzhou, China, during September 7–9, 2011 As organized by the Web Intelligence Consortium (WIC) and IEEE Computational Intelligence Society Task Force on Brain Informatics (IEEE TF-BI), as well as Lanzhou University, this conference marked the seventh of the AMT series since its debut conference at Hong Kong Baptist University in 2001 (followed by AMT 2004 in Chongqing, China, AMT 2005 in Kagawa, Japan, AMT 2006 in Brisbane, Australia, AMT 2009 in Beijing, China, and AMT 2010 in Toronto, Canada) In the great digital era, we are witnessing many rapid scientific and technological developments in human-centered, seamless computing environments, interfaces, devices, and systems with applications ranging from business and communication to entertainment and learning These developments are collectively best characterized as active media technology (AMT), a new area of intelligent information technology and computer science that emphasizes the proactive, seamless roles of interfaces and systems as well as new media in all aspects of digital life An AMT-based system offers services to enable the rapid design, implementation and support of customized solutions There are bidirectional mutual support fields for AMT researchers The topics aim to explore and present the state-of-the-art works in many interesting fields These fields include the following research topics: active computer systems and intelligent interfaces; adaptive Web systems and information-foraging agents; agent-based software engineering and multi-agent systems; AMT for the Semantic Web and Web 2.0; cognitive foundations for AMT; conversational informatics; data mining, ontology mining and Web reasoning; digital city and digital interactivity; e-commerce and Web services; e-learning, entertainment and social applications of active media; evaluation of active media and AMT-based systems; human–Web interaction; human factors in AMT; information retrieval; machine learning and human-centered robotics; multi-modal processing, detection, recognition, and expression analysis; network, mobile and wireless security; personalized, pervasive, and ubiquitous systems and their interfaces; semantic computing for active media and AMT-based systems; sensing Web; smart digital media; trust on Web information systems; Web-based social networks; and Web mining, wisdom Web and Web intelligence Here we would like to express our gratitude to all members of the Conference Committee for their instrumental and unfailing support AMT 2011 had a very exciting program with a number of features, ranging from keynote talks, technical sessions, workshops, and social programs This would not have been possible without the generous dedication of the Program Committee members VI Preface and the external reviewers in reviewing the papers submitted to AMT 2011, of our keynote speakers, Ali Ghorbani of the University of New Bunswick, Toyoaki Nishida of Kyoto University, Lin Chen of the Chinese Academy of Sciences, Frank Hsu, Fordham University, Zhongtuo Wang of Dalian University of Technology (Xuesen Qian Memoriam Invited Talk), and Yulin Qin of Beijing University of Technology (Herbert Simon Memoriam Invited Talk), and the Organizing Chairs, Timothy K Shi, Juerg Gutknecht, Junzhou Luo, as well as the organizer of the special session, Hanmin Jung We thank them for their strong support and dedication We would also like to thank the sponsors of this conference, ALDEBARAN Robotics Company, ShenZhen Hanix United, Inc., and ISEN TECH & TRADING Co., Ltd AMT 2011 could not have taken place without the great team effort of the Local Organizing Committee, the support of the International WIC Institute, Beijing University of Technology, China, and Lanzhou University, China Our special thanks go to Juzhen Dong, Li Liu, Yi Zeng, and Daniel Tao for organizing and promoting AMT 2011 and coordinating with BI 2011 We are grateful to Springer’s Lecture Notes in Computer Science (LNCS/LNAI), team for their generous support We thank Alfred Hofmann and Christine Reiss of Springer for their help in coordinating the publication of this special volume in an emerging and interdisciplinary research field June 2011 Ning Zhong Vic Callaghan Ali A Ghorbani Bin Hu Organization Conference General Chairs Ali A Ghorbani Bin Hu University of New Brunswick, Canada Lanzhou University, China, and ETH Zurich, Switzerland Program Chairs Ning Zhong Vic Callaghan International WIC Institute, Beijing University of Technology, China Maebashi Institute of Technology, Japan University of Essex, UK Organizing Chairs Timothy K Shi Juerg Gutknecht Junzhou Luo National Central University, Taiwan Swiss Federal Institute of Technology Zurich, Switzerland Southeast University, China Publicity Chairs Li Liu Daniel Tao Yi Zeng Lanzhou University, China Queensland University of Technology, Australia Beijing University of Technology, China WIC Chairs/Directors Ning Zhong Jiming Liu Maebashi Institute of Technology, Japan Hong Kong Baptist University, Hong Kong IEEE TF-BI Chair Ning Zhong Maebashi Institute of Technology, Japan VIII Organization WIC Advisory Board Edward A Feigenbaum Setsuo Ohsuga Benjamin Wah Philip Yu L.A Zadeh Stanford University, USA University of Tokyo, Japan The Chinese University of Hong Kong, Hong Kong University of Illinois, Chicago, USA University of California, Berkeley, USA WIC Technical Committee Jeffrey Bradshaw Nick Cercone Dieter Fensel Georg Gottlob Lakhmi Jain Jianchang Mao Pierre Morizet-Mahoudeaux Hiroshi Motoda Toyoaki Nishida Andrzej Skowron Jinglong Wu Xindong Wu Yiyu Yao UWF/Institute for Human and Machine Cognition, USA York University, Canada University of Innsbruck, Austria Oxford University, UK University of South Australia, Australia Yahoo! Inc., USA Compiegne University of Technology, France Osaka University, Japan Kyoto University, Japan Warsaw University, Poland Okayama University, Japan University of Vermont, USA University of Regina, Canada Program Committee Jiannong Cao Sharat Chandran Sung-Kwon Choi Sung-pil Choi Chin-Wan Chung Alexander Felfernig Xiaoying (Sharon) Gao Joseph A Giampapa Adrian Giurca William Grosky Daryl Hepting Masahito Hirakawa Mark Hoogendoorn Ching-Hsien Hsu Hong Kong Polytechnic University, Hong Kong Indian Institute of Technology Bombay, India Electronics and Telecommunications Research Institute, Korea Korea Institute of Science and Technology Information, Korea Korea Advanced Institute of Science and Technology, Korea Graz University of Technology, Austria Victoria University of Wellington, New Zealand Carnegie Mellon University, USA Brandenburg University of Technology at Cottbus, Germany University of Michigan, USA University of Regina, Canada Shimane University, Japan VU University Amsterdam, The Netherlands Chung Hua University, Taiwan Organization Jiajin Huang Wolfgang Huerst Hiroshi Ishikawa Hanmin Jung Brigitte Kerherve Haklae Kim Seung Kwon Yeong Su Lee Kuan-Ching Li Qing Li Xining Li Li Liu Brien Maguire Wenji Mao Yoshihiro Okada Felix Ramos Abdulmotaleb El Saddik Eugene Santos Gerald Schaefer Dominik Slezak Kazunari Sugiyama Yuqing Sun Rune Saetre Xijin Tang Haipeng Wang Wang Wei Yue Xu Jian Yang Zeng Yi Tetsuya Yoshida Shichao Zhang Zili Zhang Zhangbing Zhou Tingshao Zhu William Zhu IX Beijing University of Technology, China Utrecht University, The Netherlands Kagawa University, Japan Korea Institute of Science and Technology Information, Korea Université du Québec à Montréal, Canada Samsung Electronics Inc., Korea Choi Electronics and Telecommunications Research Institute, Korea Munich University, Germany Providence University, Taiwan City University of Hong Kong, Hong Kong University of Guelph, Canada Lanzhou University, China University of Regina, Canada Institute of Automation, CAS, China Kyushu University, Japan Research and Advanced Studies Center, Mexico University of Ottawa, Canada University of Connecticut, USA Loughborough University, UK University of Warsaw and Infobright Inc., Poland National University of Singapore, Singapore Shandong University, China Norwegian University of Science and Technology, Norway Academy of Mathematics and Systems Science, CAS, China Northwestern Polytechnical University, China Lanzhou University, China Queensland University of Technology, Australia Beijing University of Technology, China Beijing University of Technology, China Hokkaido University, Japan University of Technology, Sydney, Australia Southwest University, China Institut TELECOM and Management SudParis, France Graduate University of Chinese Academy of Sciences, China University of Electronic Science and Technology, China Table of Contents Keynote Talks People’s Opinion, People’s Nexus, People’s Security and Computational Intelligence: The Evolution Continues Ali Ghorbani Towards Conversational Artifacts Toyoaki Nishida The Global-First Topological Definition of Perceptual Objects, and Its Neural Correlation in Anterior Temporal Lobe Lin Chen, Ke Zhou, Wenli Qian, and Qianli Meng Combinatorial Fusion Analysis in Brain Informatics: Gender Variation in Facial Attractiveness Judgment D Frank Hsu, Takehito Ito, Christina Schweikert, Tetsuya Matsuda, and Shinsuke Shimojo Study of System Intuition by Noetic Science Founded by QIAN Xuesen Zhongtuo Wang Study of Problem Solving Following Herbert Simon Yulin Qin and Ning Zhong 27 28 Data Mining and Pattern Analysis in Active Media A Heuristic Classifier Ensemble for Huge Datasets Hamid Parvin, Behrouz Minaei, and Hosein Alizadeh 29 Ontology Extraction and Integration from Semi-structured Data Shaobo Wang, Yi Zeng, and Ning Zhong 39 Effectiveness of Video Ontology in Query by Example Approach Kimiaki Shirahama and Kuniaki Uehara 49 A Survey of Energy Conservation, Routing and Coverage in Wireless Sensor Networks Wang Bin, Li Wenxin, and Li Liu 59 A Multi-type Indexing CBVR System Constructed with MPEG-7 Visual Features Yin-Fu Huang and He-Wen Chen 71 342 S Lee et al Institution report also consists of three parts: institution profile, trends and competition Profile statements are fetched from OpenCalais6 Institution trends are summarized with its major technologies and research direction Institution report also provides its competing institutions on the major technologies Discussion There are several existing tools like VantagePoint, Aureka, STN AnaVist and Thomson Data Analyzer that provide comparable functions and services to InSciTe These tools provide users with various complex quantitative-level analyses for the input data The analyzed results are usually presented in two-dimensional matrices, which can be controlled flexibly by users These tools also provide various useful visualization components such as contour map, especially suitable for quantitativelevel analysis However, these tools require users to secure their interesting technical literature in advance and make the literature in system-specific format before importing it into the tools These tools also require complicated and skilled techniques for users to get useful analyzed results from the imported data Most of these tools provide useful quantitative-level analyses such as global trend, trend by each research agent or technology, and researchers’ network, but not provide comparative analyses such as competitive or cooperative relations between research agents in the multi-faceted viewpoints That is, these tools support analyses only in single viewpoint of year, research agent and technology Some of the tools even have a limit in the size of the data to be processed VantagePoint, of course, can provide some analysis results in two-dimensional viewpoint by users’ selection, but many parts of multi-faceted analyses remain to users’ work yet Complicated usage of these tools makes them more appropriate to skilled analysts than general users These tools are also suitable to focused analyses in a narrow domain On the contrary, InSciTe was designed not to require users to learn complicated usages and provide shallow, but horizontal, canned services so that InSciTe covers all the research and industrial fields InSciTe is based on Semantic Web technologies as well as text-mining technologies and so internally processes the data in RDF format [5] The data include bibliographic metadata as well as named entities and their relations mined from unstructured text of technical literature InSciTe extracts significant entities and their relations from technical literature and supports decisionmaking by combining the extracted data with metadata on a semantic service platform InSciTe can provide diverse analyses in the multi-faceted viewpoint by mutually combining several entities based on their possible relations although it does not provide highly-complicated quantitative-level analysis algorithms and flexibilities in combining the analysis conditions that users want In addition, InSciTe can link to Semantic Web open sources, verify how the results are semantically inferred and generate summary reports automatically As a summary, Table shows the comparison of InSciTe to VantagePoint which is the most popular one of the tools http://www.opencalais.com/ Using Semantic Web Technologies for Technology Intelligence Services 343 Table The comparison of InSciTe to VantagePoint Data size Target users VantagePoint ~ 20,000 records Analyst, consultant Bibliographic database (import filter) 2-dimensional (co-occurrence matrices, maps and networks) DB Metadata, full-text (DB2OWL) Dimension of analysis Multi-dimensional Text mining level Entity/relation extraction Canned services Pull and push services Keyword extraction DIY, scripting Pull services Ontology model Expectancy value using Bernoulli process Service type/method Others InSciTe ~ tens of millions records Planner, expert, chief officer, … Conclusion To spread and activate Technology Intelligence over research and industrial fields, we proposed shallow, but automated, Technology Intelligence services which can reduce the amount of labor required from experts A technology intelligence service, InSciTe (http://www.ontoframe.kr/InSciTe/), has been developed using Semantic Web technologies as well as text mining It extracts meaningful technologies and relations among technologies from large-scale technical literature and combines them with meta-data on a semantic service platform to enhance their analytical values It analyzes correlations among technologies, research agents and research outcomes, focusing on the relations such as competition and cooperation It targets decisionmaking researchers who have responsibility for establishing R&D strategy and provides insights required for them to establish their R&D strategy or make a decision on their research and business direction We also explained our Semantic Web technologies, such as ontology modeling, semantic repository, inference and verification, applied for InSciTe and how they make its services possible In the future, we plan to evaluate InSciTe through user study and continue to expand our approach to give researchers more insights such as prediction on emerging and promising technologies References Mortara, L., Kerr, C., Phaal, R., Probert, D.: Technology Intelligence: Identifying threats and opportunities from new technologies University of Cambridge Institute for Manufacturing, UK (2007) Lang, H.-C., Mueller, M.: Technology Intelligence Identifying and Evaluating New Technologies In: Portland International Conference on Management and Technology, p 218 (1997) 344 S Lee et al Schuh, G., Grawatsch, M.: TRIZ-based Technology Intelligence In: European TRIZ Association meeting TRIZFutures (2003) Rowe, G., Wright, G.: The Delphi technique as a forecasting tool: issues and analysis International Journal of Forecasting 15, 353–375 (1999) Resource Description Framework (RDF): Concepts and Abstract Syntax, http://www.w3.org/TR/rdf-concepts/ Jeong, C.-H., Choi, S.-P., Choi, Y.-S.: Introduction of the Scientific Intelligence Discovery Framework using Grid Computing In: International Conference on Convergence Content (2009) Lee, S., Kim, P., Lee, M., Jung, H., Sung, W.-K.: Efficiency of DBMS-based Ontology Storing In: International Conference on Convergence Content (2008) SPARQL Query Language for RDF, http://www.w3.org/TR/rdf-sparql-query/ Lee, S., Jung, H., Sung, W.-K.: Supporting SPARQL in OntoThink-K, an Inference Service based on R-DBMS In: 2006 Fall Conference on Korea Information Science Society, pp 223–227 (2008) (in Korean) 10 Forgy, C.L.: Rete: A Fast Algorithm for the Many Pattern/Many Object Pattern Match Problem Artificial Intelligence 19(1), 17–37 (1982) 11 Doorenbos, R.B.: Production Matching for Large Learning Systems Ph.D Thesis, Carnegie Mellon University, Pittsburgh, PA (1995) 12 RDF Semantics,http://www.w3.org/TR/rdf-mt/ 13 OWL Web Ontology Language Semantics and Abstract Syntax, http://www.w3.org/TR/owl-semantics/ 14 Lee, S., Jung, H., Kim, P., You, B.-J.: Dynamically Materializing Wild Pattern Rules Referring to Ontology Schema in Rete Framework In: 1st Asian Workshop on Scalable Semantic Data Processing (AS2DP), China (2009) 15 Lee, S., Seo, D., Kim, P., Lee, M., Jung, H., Sung, W.-K.: Indexing Triple Dependencies for Inference Verification In: International Conference on Convergence Content, Japan (2010) Procedural Knowledge Extraction on MEDLINE Abstracts Sa-kwang Song1, Heung-seon Oh2, Sung Hyon Myaeng2, Sung-Pil Choi1, Hong-Woo Chun1,Yun-Soo Choi1, and Chang-Hoo Jeong1 Korea Institute of Science and Technology Information, Korea Korea Advanced Institute of Science and Technology, Korea esmallj@kisti.re.kr, {ohs,myaeng}@kaist.ac.kr, {spchoi,hw.chun,armian,chjeong}@kisti.re.kr Abstract Text mining is a popular methodology for building Technology Intelligence which helps companies or organizations to make better decisions by providing knowledge about the state-of-the-art technologies obtained from the Internet or inside companies As a matter of fact, the objects or events (socalled declarative knowledge) are the target knowledge that text miners want to catch in general However, we propose how to extract procedural knowledge rather than declarative knowledge utilizing machine learning method with deep language processing features, as well as how to model it We show the representation of procedural knowledge in MEDLINE abstracts and provide experiments that are quite promising in that it shows 82% and 63% performances of purpose/solutions (two components of procedural knowledge model) extraction and unit process (basic unit of purpose/solutions) identification respectively, even though we applied strict guidelines in evaluating the performance Keywords: Text Mining, Information Extraction, Technology Intelligence, Procedural Knowledge Modeling, Procedural Knowledge Extraction Introduction Technology Intelligence is an activity helping companies or organizations to make better decisions by gathering and providing information about the state-of-the-art technologies [1] Recently, the systems supporting Technology Intelligence have been actively developed to assist researchers and practitioners to make strategic technology plans [2] Usually, these systems import text mining methodologies to analyze tacit information inside company or on the Internet However, they focused on extracting declarative knowledge, which describes objects and events by specifying the properties which characterize them; it does not pay attention to extract the actions needed to obtain a result, but only on its properties [3] Therefore, we propose a methodology that enables to build procedural knowledge using text mining technique based on deep language processing In general, procedural knowledge has been considered as knowledge of how to something or knowledge of skills [4] It is contrasted with propositional knowledge or declarative N Zhong et al (Eds.): AMT 2011, LNCS 6890, pp 345–354, 2011 © Springer-Verlag Berlin Heidelberg 2011 346 S.-k Song et al knowledge Even though two kinds of knowledge have been defined differently in different domain, Sahdra and Thagard [4] summarized them as shown in Table Table Different terms used with respect to knowledge-how and knowledge-that Philosophy Psychology Artificial Intelligence Knowledge-that propositional knowledge Explicit knowledge, declarative knowledge Knowledge-how procedural knowledge, abilities Implicit knowledge, tacit abilities, skills Declarative knowledge Procedural knowledge For intuitive understanding of procedural knowledge, let us introduce a sample snippet from a description of gastrectomy surgical procedure “… In a total gastrectomy, clamps are placed on the end of the esophagus and the end of the small intestine The stomach is removed and the esophagus is joined to the intestine …” In the example, the goal of the snippet is to describe how to gastrectomy surgery it consists of several unit procedures, such as placing clamps, removing stomach, joining esophagus, and those three unit procedures are sequential so that ( and ) and ( and ) have sequential relationship respectively Therefore, we can simplify the example as the following graphical representation It has two main components Goal (Purpose) and Solution to achieve the goal ③ ② ② ③ ① ② ① Fig Graphical representation of the example If target documents could be structured as shown in the Fig 1, there could be a lot of application analyzing such highly organized knowledge, so called procedural knowledge As an example, the procedural knowledge in the biomedical domain enables doctors or researchers find state-of-the-art technologies and their detailed procedures conveniently So, they can improve the quality of medication services as well as technology enhancements Moreover, it is also beneficial to the policy makers Procedural Knowledge Extraction on MEDLINE Abstracts 347 in governments or companies on building new plans preparing for the upcoming highly diversified world We explain related work in section and describe how to model and extract the procedural knowledge in MEDLINE abstracts in section Section shows two major experiments; purpose/solution sentence classification and unit procedure identification The results on how to extract procedural knowledge using text mining methodologies are followed at section At last, we summarize and conclude in section Related Work A lot of research on extracting information like terminology, entity, and concept using various resources such as dictionary, thesaurus, or ontology has been published continuously until now [5-7] The research on relation or event extraction between them also has been popular these days [8,9] However, those works have been focusing on knowledge-that instead of knowledge-how Even though Jung, et al [10] extracted procedural knowledge and built ontology from the web documents like eHow and wikiHow , their target documents are already structured (listed) in a bulleted sequential form For an example in wikiHow, there is an article labeled “How to Celebrate National Egg Month” It contains sequential instructions which are imperative sentences From the article, he extracted sequential actions and built ontology for further usage The sequential instructions are structured by the wiki-authors In addition, parsing the sentences is straightforward since almost of them are simple sentences rather than compound or complex sentences Methodology for Procedural Knowledge Extraction 3.1 Modeling Based on the conceptual representation of the procedural knowledge in Fig 1, we defined it as a set of unit procedures which are structured to solve a specific purpose or goal That is, each unit in a set of procedures has a purpose in common to be resolved by the procedures So, the target document could be represented as a pair of purpose and its solution which consists of a set of graphs of unit procedures It can be depicted as following Fig As depicted in Fig 2, we defined procedural knowledge as a combination of a purpose and a corresponding solution And the solution consists of one or more unit procedures having relationships with each other The unit procedure is a triple combination of Target, Method, and Action; Target is defined as diseases, symptoms, objects, organs, and so on Method is treatments, operations, medications, etc Action is a predicate part connecting or relating Target with Method to explain how to apply a Method to treat a Target disease or symptom eHow URL: http://www.ehow.com wikiHow URL: http://www.wikihow.com 348 S.-k Song et al This modeling has been carried out with medical doctors who have supported us because of their professional knowledge in the medical domain and being one of the best benefit recipients from this research Fig Graphical representation of procedural knowledge in MEDLINE abstract: T, A, and M are acronyms of Target, Action, and Method respectively 3.2 Extraction Procedures According to the model constructed in subsection 3.1, we designed how the procedural knowledge could be extracted Fig depicts major four steps to build procedural knowledge • • • • Step is to preprocess target documents by extracting possible lexical, syntactic, or semantic features using various natural language processing techniques; such POS tagging, syntactic parsing, predicate-argument structure tagging, and ontology based terminology identification In step 2, the purpose and solution sentences among the entire sentences belonging to a document are classified The features gathered in the step are supplied to machine learning algorithms that actually classify the sentences into one of the three categories (purpose/solution/other) Step is to identify the unit procedures in each purpose or solution sentences A unit procedure consists of three basic entities, Target/Action/Method Unit procedure must have at least two entities except that in purpose sentence The triple from the purpose sentence is considered as not an actual procedure but a nominal procedure So, the triple is used only for simplifying purpose sentence In Step 4, the relationship between two unit processes is assigned The relationship could be sequential, parallel, casual, etc Procedural Knowledge Extraction on MEDLINE Abstracts 349 Fig Flow Diagram of Procedural Knowledge Extraction In this paper, we describe the first three parts in the above flow diagram 3.3 Target Documents In this paper, the target document for procedural knowledge is confined to the areas of Gastric Cancer and Spinal Disease by the help of medical doctors having been working together That’s because those diseases have more probability of sentences containing appropriate procedural knowledge as well as they are popular and familiar topics people are interested in Most of review papers or case study papers popular in the biomedical domain are not appropriate for procedural knowledge extraction since they not include experiments or methodologies which contain procedural information The following document snippet comes from the MEDLINE abstract It is semantically divided into several blocks by authors on submitting their papers The blocks are classified as OBJECTIVE, BACKGROUND, METHODS, RESULTS, and CONCLUSIONS in general Sometimes, one or more block are omitted or merged OBJECTIVE: To examine the impact of malignancy and location of the cerebellar tumor on motor, cognitive, and psychologic outcome BACKGROUND: Although many …… METHODS: Children, aged from to 13 years, with a cerebellar malignant tumor (MT; MT group, n=20) or a cerebellar benign tumor (BT; BT group, n=19) were examined at least months after the end of treatment using the international cooperative ataxia rating scale, the Purdue pegboard for manual skill assessment and the ageadapted Weschler scale RESULTS: Parents and teachers reported high rate of learning and academic difficulties, …… CONCLUSIONS: Dentate nuclei lesions are major risk factors of motor and cognitive impairments in both cerebellar BT and MT 350 S.-k Song et al The meaning of the blocks is as follows: • • • • • OBJECTIVE: describes purpose of the paper BACKGROUND: provides background information METHODS: contains methods (or solutions) to achieve the purpose RESULTS: shows results from the applied methods CONCLUSIONS: summaries and finalizes with consequences By the way, we are focusing on classifying purpose and solution parts which include two or more out of the three entities So, it is differentiated with other researches on sentence classification [11,12] because we only select the sentences possibly containing one or more entities (Target, Action, and Method) In general, the solution part consists of one or more methodological sentences Let us show a detailed example for process identification as follows Children with a cerebellar malignant tumor or a cerebellar benign tumor were examined at least months after the end of ① treatment using the ② international cooperative ataxia rating scale, the ③ Purdue pegboard for manual skill assessment and the ④ age-adapted Weschler scale In the example above, the target disease is ‘cerebellar malignant tumor’ or ‘cerebellar benign tumor’, and the three methods ( ② , ③ , ④ ) are measured concurrently (or sequentially) And the word ‘after’ is used as an temporal sequence indicator Based on these kinds of information, the procedural knowledge could be extracted 3.4 Training Corpus We developed a training corpus for extracting procedural knowledge by the help of two medical doctors Total 1309 documents are tagged with purpose/solution labels which contain one or more unit processes (Triple: Target, Action, and Method) In addition, the relationship between two unit processes is also marked After tagging, the two doctors carried out cross-validation of the tagged corpus The statistics with respect to domain and disease of the training corpus is as follows Table Statistics of training corpus with respect to domain and disease Domain Spinal Disease Gastric Cancer Total Disease Neural Tube Defects Neurilemmoma Spinal Dysraphism Kyphosis Spondylolisthesis Lordosis Stomach Neoplasms Therapy Neoplasms Endoscopy # of documents 242 128 159 236 98 86 180 180 1309 Procedural Knowledge Extraction on MEDLINE Abstracts 351 Experiments The experiments are divided into two parts; purpose/solution classification, unit process identification The former classifies sentences into one of three classes: purpose, solution, and others, but the latter identifies the triple, Target/Action/Method, in the purpose/solution sentences For preparing the two experiments, several text mining techniques are applied to the target documents explained in the next subsection 4.1 Preprocessing The target documents are preprocessed with Part Of Speech (POS) Tagging, Syntactic Parsing, Predicate-Argument Structure Tagging, and Ontology Mapping The POS tagging has been applied using Enju parser Predicate-argument structure [13] is applied, which is a representation of the meaningful relationships of words in a sentence according to the relation between predicate and its arguments At last, the ontology mapping for terminology identification is added The terms corresponding to ontology item in UMLS4, UniProt5, or GO(Gene Ontology)6 are marked This deep processed information is utilized on training or testing the machine learning based algorithms explained in subsection 4.2 4.2 Purpose/Solution Sentence Classification Extracting purpose/solution sentences from an abstract could be regarded as a classification problem selecting one category out of three categories such as purpose, solution, and other For this task, we utilized two machine learning approaches, Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) The reason why we applied CRFs, frequently used in sequence labeling problem, in addition to the SVMs is that the order of the semantic blocks in abstract are sequential The features for this experiment consist of four kinds of items: content features, position features, neighbor features, and ontological features • • • • Content features: unigrams and bigrams in target sentence Stemming [14] and Stopwords elimination are applied Position features: sentence number of target sentence in the abstract The purpose sentence tends to be located at the first few sentences and the solution sentences are rather later part of the abstract Neighbor features: content features of previous and next k sentences of the target sentence Ontological features: ontology terms in the UMLS, UniProt, and GO Enju Parser, http://www-tsujii.is.s.u-tokyo.ac.jp/enju/ UMLS, http://www.nlm.nih.gov/research/umls/ UniProt, http://www.uniprot.org/ Gene Ontology, http://www.geneontology.org/ LIBSVM v3.0, http://www.uniprot.org/ Mallet 2.0 for CRFs, http://mallet.cs.umass.edu/ 352 4.3 S.-k Song et al Unit Process Identification This experiment is to extract the triple of Target, Action, and Method (abbreviated as TAM) as a unit process using CRFs algorithm with kinds of features as follows • • • • Word features: word, word lemma, POS tag, whether first character is capital or not, whether all characters are capital or not Context features: words and POS tags of previous and next k words of the target word Predicate-argument structure: predicate type and its argument words and POS tags Ontological features: ontology terms in the UMLS, UniProt, and GO This task is to find the boundary of the word or phrase that is recognized as Target, Action, or Method Therefore, we used most widespread representation so-called IOB tags for chunking of each entity The B and I tags are suffixed with the entity type, e.g B-Target, I-Target, B-Action, I-Action, B-Method, and I-Method Of course, it is not necessary to specify a chunk type for tokens that appear outside an entity, so these are just labeled O An example of this scheme is shown in Fig Fig Tagging TAM entities of a unit process 5.1 Results Results on Purpose/Solution Sentence Classification The training and test set are divided in the ratio 8:2 (leave-two-out method) and CRFs and SVMs methods are applied to train purpose/solution sentence classification models The F-1 score of purpose sentence classification using CRFs achieved 85% while it is relatively low (69%) in solution sentence classification The reason why the performance is rather bad in solution sentence classification is that there are quite a few sentences that have not at least two entities out of the TAM, even though the sentence sequences of the abstract affect the performance in assigning categories of the sentences However, the result using SVMs is quite promising since the F-1 scores of the two tasks are 87% and 80% respectively in Table Recall that the model for this experiment is not only for a sentence classification but also for checking whether the sentence contains TAM or not Actually, some sentences in METHODS block could Procedural Knowledge Extraction on MEDLINE Abstracts 353 not be assigned to solution category because they only have at most one component of TAM So it is rather different from the general sentence classification [11,12] which performs over 0.90 in their F-1 scores Table Purpose/Sentence Classification Results Purpose Solution Total Precision CRFs SVMs 0.8326 0.8462 0.6923 0.8333 0.7279 0.8369 Recall CRFs SVMs 0.8578 0.9009 0.6913 0.7610 0.7326 0.7957 F-1 CRFs 0.8450 0.6918 0.7303 SVMs 0.8727 0.7955 0.8158 The both machine learning methods show in common that performance on purpose is better than that on solution because of the consistency in writing the purpose sentences Usually, ‘to ~’, ‘the aim of this study ~’, and ‘the goal is ~’ are the sentence patterns frequently observed in purpose sentences, while it is hard to find the common pattern in solution sentences 5.2 Results on Unit Process Identification For this experiment, the training and the test set are also divided in the ratio 8:2 (leave-two-out method) and only the CRFs method is applied to train TAM identification model As we mentioned previously, the performance below does not include partial matching in multi-word entities since most of the medical terms are very sensitive in the semantic perspective according to medical experts For example, the substring such as ‘cooperative ataxia rating scale’, ‘ataxia rating scale’, or ‘rating scale’ is not regarded as the correct one in the Method term, ‘international cooperative ataxia rating scale’, shown in subsection 3.3 The result on Action entity shows high compared to the other two because the number of words in Action entity is at most 2-3 and the main word is verb or verb equivalent On the contrary, Target and Method entities are large in their length and they contain relatively more adverbs/adjectives as well as composite nouns Table TAM Identification using CRFs Target Action Method Total Precision 0.5212 0.7878 0.6014 0.6401 Recall 0.5696 0.7753 0.5078 0.6102 F-1 0.5443 0.7815 0.5507 0.6248 Conclusion We proposed a procedural knowledge modeling and extraction method for Technology Intelligence based on machine learning approaches with deep language processing analysis The experiments showed that the proposed approach is quite 354 S.-k Song et al promising because it shows 82% and 63% in both purpose/sentence classification and unit process identification respectively in Table 4, even though we applied strict guidelines in evaluating the performance In addition, we built a handcrafted valuable training corpus with two medical doctors, which have 1309 MEDLINE abstracts categorized into diseases from both gastric cancer and spinal disease In addition, for future work, we plan to identify the relationship between unit processes as shown in the flow diagram in Fig 3, using machine learning approaches like SVM, CRF, etc References [1] Mortara, L., Kerr, I.V.C., Phaal, R., Probert, D.: Technology Intelligence Practice in UK Technology-based Companies International Journal of Tehcnology Management 48, 115–135 (2009) [2] Yoon, B.: On the development of a technology intelligence tool for identifying technology opportunity Expert Systems with Applications 35, 124–135 (2008) [3] Turban, E., Aronson, E.: Decision Support Systems and Intelligent Systems Prentice Hall, Inc., Upper Saddle River (1988) [4] Sahdra, B., Thagard, P.: ‘Procedural knowledge in molecular biology Philosophical Psychology 16, 477–498 (2003) [5] Kazuhiro, Y., Junichi, T.: Reranking for Biomedical Named-Entity Recognition Biomedical Natural Language Processing, BioNLP (2007) [6] Yoshimasa, T., Junichi, T., Sophia, A.: FACTA: a text search engine for finding associated biomedical concepts Bioinformatics 24, 2559–2560 (2008) [7] Sophia, A., Carol, F., Junichi, T.: ‘Introduction: named entity recognition in biomedicine Biomedical Informatics 37, 393–395 (2004) [8] Hong-Woo, C., Yoshimasa, T., Jin-Dong, K., Rie, S., Naoki, N., Teruyoshi, H., Junichi, T.: Automatic recognition of topic-classified relations between prostate cancer and genes using MEDLINE abstracts BMC Bioinformatics (2006) [9] Toshihide, O., Haretsugu, H., Akira, T., Toshihisa, T.: “Automated extraction of information on protein-protein interactions from the biological literature Bioinformatics 12, 155–161 (2001) [10] Jung, Y., Ryu, J., Kim, K.-m., Myaeng, S.-H.: Automatic construction of a large-scale situation ontology by mining how-to instructions from the web Web Semantics: Science, Services and Agents on the World Wide Web 8, 110–124 (2010) [11] Hirohata, K., Okazaki, N., Ananiadou, S., Ishizuka, M., Biocentre, M.I.: “Identifying sections in scientific abstracts using conditional random fields In: Proc of 3rd International Joint Conference on Natural Language Processing, pp 381–388 (2008) [12] Ruch, P., Boyer, C., Chichester, C., Tbahriti, I., Geissbühler, A., Fabry, P., Gobeill, J., Pillet, V., Rebholz-Schuhmann, D., Lovis, C., Veuthey, A.-L.: “Using argumentation to extract key sentences from biomedical abstracts International journal of medical informatics 76, 195–200 (2007) [13] Yakushiji, A., Miyao, Y., Ohta, T., Tateisi, Y., Tsujii, J.: “Automatic construction of predicate-argument structure patterns for biomedical information extraction In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing - EMNLP 2006, p 284 (2006) [14] Porter, M.F.: An algorithm for suffix stripping Program 14, 130–137 (1980) Author Index Alizadeh, Hosein Amghar, Youssef 29 206 Bahig, Hatem M Bin, Wang 59 Bose, S 312 101 Chen, Chen 229 Chen, He-Wen 71 Chen, Lin Choi, Sung-Pil 324, 345 Choi, Yun-Soo 324, 345 Choochaiwattana, Worasit 162 Chuang, Cheng-Tao 263 Chun, Hong-Woo 324, 345 Dahak, Fouad 206 Dai, Bin 83 Daoud, Sameh S 101 Dasler, Philip 300 Drogoul, Alexis 276 Ghorbani, Ali Goh, Dion Hoe-Lian 129 Guan, Zengda 186 Guo, Jinkai 83 Guo, Wenqiang 217 Ha, Quang-Thuy 173 Himesh, P.H 312 Hsu, D Frank Hu, Cuiyun 241 Huang, Ming-Jui 263 Huang, Yin-Fu 71 Huynh, Hiep Xuan 276 Ito, Takehito Jeong, Chang-Hoo 324, 345 Jeong, Do-Heon 324 Jia, Ke-bin 92 Jiang, Bin 92 Jomsri, Pijitra 162 Jung, Hanmin 333 Kannan, A 312 Kichou, Saida 206 Kim, Pyung 333 Kim, Tae Hong 333 Le, Minh Ngoc 276 Lee, Chei Sian 129 Lee, Jinhee 333 Lee, Mikyoung 333 Lee, Seungwoo 333 Li, Ang 123, 186 Li, Lian 217 Li, Mi 113 Li, Xin 217 Li, Xining 217 Li, Yi-Lin 123 Li, Yuefeng 141 Lin, Jiazao 217 Liou, Cheng-Yuan 263 Liu, Li 59, 217 Lu, Shengfu 113 Luu, Cong-To 173 Ma, Long 129 Ma, Yunfei 195 Mao, Xinjun 241 Matsuda, Tetsuya Mellah, Hakima 206 Meng, Qianli Minaei, Behrouz 29 Myaeng, Sung Hyon 345 Nasr, Dalia B 101 Ning, Yue 186 Nishida, Toyoaki Niu, Jianwei 83, 288 Oh, Heung-seon 345 Parvin, Hamid 29 Pei, Yu-Xi 123 Pham, Huyen-Trang Qian, Wenli Qin, Linchan 113 173 356 Author Index Qin, Yulin 28 Qu, Guangzhi 288 Ren, Xu 195 Sanguansintukul, Siripun 162 Satoh, Ichiro 251 Schweikert, Christina Seo, Dongmin 333 Shimojo, Shinsuke Shirahama, Kimiaki 49 Song, Sa-Kwang 324, 345 Song, Yangyang 113 Sun, Yuekun 241 Sun, Yuqing 229 Sung, Won-Kyung 324, 333 Tang, Shan 123 Toˇsić, Predrag T 300 Truong, Viet Xuan 276 Uehara, Kuniaki 49 Vijayakumar, P Vu, Tien-Thanh 312 173 Wang, Shaobo 39 Wang, Shu-Juan 123 Wang, Zhongtuo 27 Wendt, Jake 288 Wenxin, Li 59 Xin, Jiu-Ling 123 Yang, Kai-Hsiang Yang, Yi 217 263 Zeng, Yi 39, 153, 195 Zhang, Qi 123 Zhang, Wei-Chen 123 Zhao, Lulin 153 Zhong, Ning 28, 39, 113, 141, 153, 195 Zhou, Erzhong 141 Zhou, Huiping 241 Zhou, Ke Zhu, Ting-Shao 123, 186 Zhu, Zhuo-Hong 123 ... Choochaiwattana An Upgrading Feature-Based Opinion Mining Model on Vietnamese Product Reviews Quang-Thuy Ha, Tien-Thanh Vu, Huyen-Trang Pham, and Cong-To... and long-term preservation, interpretation of information, and transformation of information to knowledge All these issues are complicated and hence require powerful computational and informatics... “Informatics is the science that studies and investigates the acquisition, representation, processing, interpretation, and transformation of information in, for, and by living organisms, neuronal

Ngày đăng: 16/12/2017, 08:02