AUTOMATIC TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK Yueyu Fu Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the School of Library and Information Science, Indiana University October 2006 UMI Number: 3238501 Copyright 2006 by Fu, Yueyu All rights reserved. ____________________________________________________________ UMI Microform 3238501 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. _______________________________________________________________ ProQuest Information and Learning Company 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106-1346 ii Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Doctoral Committee Date of Oral Examination (August 2 nd , 2006) Javed Mostafa, Ph.D. Charles Davis, Ph.D. Kiduk Yang, Ph.D. David Leake (Computer Science, minor), Ph.D. iii © 2006 Yueyu Fu ALL RIGHTS RESERVED iv DEDICATION To my beloved parents Guanghui Fu and Lan Chen, my dear wife Wenjie Sun, and my grandparents for their unconditional love and encouragement v ACKNOWLEDMENTS I feel so grateful to numerous people who generously provide me the guidance, support, and encouragement to complete this dissertation. First and foremost, I would like to thank Dr. Javed Mostafa, my committee chair, for his professional and personal guidance that goes far beyond his responsibilities. It is his patient guidance, sharp mind, and gentle encouragement that led me to the achievement I have today. Special thanks also go to the rest of my committee, Dr. Charles Davis, Dr. Kiduk Yang, and Dr. David Leake, for their insightful comments and enduring support during the entire process of my dissertation research. I would like to thank my colleagues and the staff at Indiana University, especially Weimao Ke, Kazuhiro Seki, Mary Kennedy, Arlene Merkel, Erica Bodnar, and Rhonda Spencer, for their kind help and support throughout all these memorable years in Bloomington. Finally, I must express my deepest gratitude to my parents, Guanghui Fu and Lan Chen, for opening my eyes to the world and encouraging me to pursue my career abroad, and to my beloved wife, Wenjie Sun, for making our family full of joy, support, and understanding. vi ABSTRACT Automatic text classification is an important operational problem in information systems. Most automatic text classification efforts so far concentrated on developing centralized solutions. However, centralized classification approaches often are limited due to constraints on knowledge and computing resources. To overcome the limitations of centralized approaches, an alternative distributed approach based on a multi-agent framework is proposed. Three major challenges associated with distributed text classification are examined: 1) Coordinating classification activities in a distributed environment, 2) Achieving high quality classification, and 3) Minimizing communication overhead. This study presents solutions to these specific challenges and describes a prototype system implementation. As agent coordination is the key component in conducting multi-agent text classification, two agent coordination protocols, namely blackboard-bidding protocol and adaptive-blackboard protocol, are proposed in the study. To analyze the performance of the distributed approach a comparative evaluation methodology is described, which treats outcome of a centralized approach as baseline performance. A series of experiments was conducted in a simulation environment. The simulation environment permitted manipulation of independent variables such as scalability and coordination strategy, and investigation of the impact on two critical dependent variables, namely efficiency and effectiveness. There were three critical findings. First, in dealing with automatic text classification the multi-agent approach can achieve improved system efficiency while maintaining classification effectiveness comparable to a centralized approach. Second, the agent protocols were effective in coordinating the text classification activities of distributed agents. Third, the application of content-based adaptive learning for acquiring knowledge about the agent community reduced communication cost and improved system efficiency. vii TABLE OF CONTENTS 1 INTRODUCTION 1 1.1 MANUAL CLASSIFICATION 1 1.2 AUTOMATIC CLASSIFICATION 2 1.3 MULTI-AGENT PARADIGM 5 2 PROBLEM STATEMENT 7 2.1 SPECIFIC CHALLENGES 8 2.2 VARIABLES 10 2.3 IMPLICATIONS OF THIS RESEARCH 16 3 LITERATURE REVIEW 18 3.1 AUTOMATIC TEXT CLASSIFICATION 18 3.1.1 Text classification task 19 3.1.2 Text classification methods 20 3.1.3 Evaluation metrics for text classification 24 3.1.4 Test Collections 26 3.1.5 Centralized Text Classification Procedure 27 3.2 TEXT CLASSIFICATION USING A MULTI-AGENT FRAMEWORK 29 3.2.1 Multi-agent paradigm 29 3.2.2 Differences between multi-agent systems and other concurrent systems 29 3.2.3 Connections between multi-agent paradigm and peer-to-peer paradigm 31 3.2.4 Recent applications of the multi-agent paradigm 34 3.2.5 Centralized vs. Multi-agent text classification 35 3.3 MULTI-AGENT COORDINATION PROTOCOLS 39 3.3.1 Definition of coordination 40 3.3.2 Coordination Protocols 41 3.3.2.1 Organizational Structuring 41 3.3.2.2 Multi-agent planning 43 3.3.2.3 Contract net protocol 44 3.3.2.4 Negotiation 45 4 METHODOLOGY 50 4.1 DATA 50 4.2 DESIGN METHODOLOGY 51 4.2.1 Multi-Agent Community for Text Classification 51 4.2.2 Classification Module 53 4.2.3 Algorithms of Agent Coordination Protocols 55 4.2.4 Proposed Agent Coordination Protocols 59 4.2.4.1 Blackboard-bidding Protocol 59 4.2.4.2 Adaptive-blackboard Protocol 61 4.3 IMPLEMENTATION 65 4.3.1 System Architecture 65 4.3.2 Alternative approach 67 4.4 EVALUATION METHODOLOGY 67 4.4.1 Measurements 67 viii 4.4.1.1 Effectiveness Measurements 67 4.4.1.2 Efficiency Measurements 68 4.4.2 Variables 70 4.4.2.1 Centralized vs. Distributed 70 4.4.2.2 Coordination Protocols 71 4.4.2.3 Number of Agents 71 4.4.3 Experimental Settings 72 5 RESULTS 72 5.1 CENTRALIZED VS. DISTRIBUTED 72 5.2 COORDINATION PROTOCOLS 75 5.2.1 Effectiveness 76 5.2.2 Efficiency Measured by Messages 79 5.2.3 Efficiency Measured by Time 82 5.3 NUMBER OF AGENTS 85 5.3.1 Impact of the number of agents on effectiveness 86 5.3.2 Impact of the number of agents on efficiency 89 6 CONCLUSIONS 91 6.1 SUMMARY 91 6.2 FUTURE RESEARCH 95 REFERENCES 97 1 1 Introduction Automatic text classification is an important operational problem in information systems. Many tasks, such as retrieval, filtering, and indexing, in information systems can be considered as classification problems. Most text classification efforts so far concentrated on developing centralized solutions, where data and computation are located on a single computer. However, centralized classification approaches often are limited due to constraints on knowledge and computing resources. In addition, centralized approaches are more vulnerable to attacks or system failures and less robust in dealing with them. This research presents an alternative classification approach, called distributed text classification using a multi-agent framework, where data and computation are distributed across a network of computers. 1.1 Manual Classification In library and information science, class/classification and category/categorization are sometimes considered as distinct terms (Jacob, 2004). Although they are both used to organize related entities, these two terms have a fundamental difference. Classification groups entities into mutually exclusive classes based on a set of predefined rules regardless of the context, whereas categorization associates entities solely based on their similarities within a given context (Jacob, 2004). This distinction makes categorization more flexible than classification in organizing similar entities. However, for the purpose of broader audience, this study uses class/classification and category/categorization interchangeably. [...]... automatic text classification as an alternative approach Using machine learning techniques, automatic text classification assigns documents to a set of pre-defined categories This approach has been applied in many areas, such as patent classification, news delivery, and email spam filtering In contrast to manual classification, automatic classification offers the advantages of automation, efficiency, and... investigate automatic text classification using a multi-agent framework Automatic text classification and the multi-agent paradigm respectively have been extensively studied over the years Although, problems within each area have been investigated, new problems that arise with the introduction of the multi-agent paradigm into automatic text classification remain mostly unexplored In this section, three major... investigates an alternative classification approach, namely distributed text classification conducted using a multi-agent framework Three major challenges associated with distributed text classification are examined: 1) Coordinating classification activities in a distributed environment, 2) Achieving high quality of classification, and 3) Minimizing communication overhead This chapter reviews literature... text classification assigns textual documents into classes using the rules or patterns learned from a set of pre-classified documents Sebastiani (2002) defines automatic text classification as a process of assigning natural language documents to predefined semantic classes Generally, the text classification task can be defined as follows: Given a set of pre-classified documents, learn the classification. .. quality of classification, system efficiency, and agent granularity One of the main goals is to achieve satisfactory classification performance in a multiagent environment Therefore, quality of classification must be taken into consideration throughout the study The quality of classification refers to the accuracy of a completed classification task In contrast to centralized text classification, quality... distributed classification and may contribute to the establishment of evaluation framework for distributed classification Ultimately, my goal is to make a scholarly contribution to the area of text classification and multi-agent systems and produce findings that will be of interest to both practitioners and researchers At the practical level, the proposed approach can facilitate the sharing of activities among... systems The advantages of centralized classification stem from the centralized architecture Because data and computing resources are located in the same place, the management of the classification task is easy and the classification speed is fast Since the communication in centralized classification takes place in the same machine, the communication cost is relatively small However, as information becomes... centralized classification, each classification agent itself is still a relatively independent classification unit This section reviews classification tasks, classification methods, test collections, and the measurements for quality of classification, which is hoped to contribute to better understanding of challenges associated with distributed text classification 18 3.1.1 Text classification task Automatic text. .. such as the Dewey Decimal Classification, the Universal Decimal Classification, and the Library of Congress Classification A recent application of this approach on the web is the Yahoo Directory, which organizes web pages into a hierarchical structure The main challenge of manual classification is its demand on resources Manual classification is a time-consuming process that relies heavily on domain... task in a distributed computing environment Distributed classification has several advantages over centralized classification The distributed architecture offers computational scalability for classification Mukhopadhyay et al (2005) demonstrate that classification time decreases dramatically with the increasing number of collaborating classification software systems Also, not completely relying on a single . small document collections. 1.2 Automatic Classification To address the problems of manual classification, researchers have explored automatic text classification as an alternative approach validate the approach of automatic text classification using a multi-agent framework. 2 Problem Statement The primary purpose of this study is to investigate automatic text classification using. that are beyond their individual capabilities.” 6 For text classification, a multi-agent paradigm offers several critical advantages. According to Sycara (1998), the multi-agent paradigm