A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS King-Ip Lin, Ravikumar Kondadadi Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA linki,kondadir@msci.memphis.edu Abstract: Document clustering is an important tool for applications such as Web search engines It enables the user to have a good overall view of the information contained in the documents However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorithms (where each document can belong to multiple clusters) are usually inefficient We propose WBSC (Word-based Soft Clustering), an efficient soft clustering algorithm based on a given similarity measure WBSC uses a hierarchical approach to cluster documents having similar words WBSC is very effective and efficient when compared with existing hard clustering algorithms like K-means and its variants Keywords: Document clustering, Word based clustering, soft clustering Introduction Document clustering is a very useful tool in today’s world where a lot of documents are stored and retrieved electronically It enables one to discover hidden similarity and key concepts Moreover, it enables one to summarize a large amount of documents using key or common attributes of the clusters Thus clustering can be used to categorize document databases and digital libraries, as well as providing useful summary information of the categories for browsing purposes For instance, a typical search on the World Wide Web can return thousands of documents Automatic clustering of the documents enables the user to have a clear and easy grasp of what kind of documents are retrieved, providing tremendous help for him/her to locate the right information Many clustering techniques have been developed and they can be applied to clustering documents [3, 6] contain examples of using such techniques Most of these traditional approaches use documents as the basis of clustering In this paper, we took an alternative approach, word-based soft clustering (WBSC) Instead of comparing documents and clustering them directly, we cluster the words used in such documents For each word, we form a cluster containing documents that have that word After that, we combine clusters that contain similar sets of documents, by an agglomerative approach [5] The resulting clusters will contain documents that share a similar set of words We believe this approach has many advantages Firstly, it allows a document to reside in multiple clusters This allows one to capture documents that contain multiple topics Secondly, it leads itself to a natural representation of the clusters, which are the words that is associated with the clusters themselves Also, the running time is strictly linear to the number of documents Even though the running time grows quadratically with the number of clusters, this number is bounded by the number of words, which has a rather fixed upper bound (the number of words in the language) This allows the algorithm to scale well The rest of the paper is organized as follows: Section summarizes related work in the field Section describes our algorithm in more detail Section provides some preliminary experimental results Section outlines future direction of this work Related work Clustering algorithms have been developed and used in many fields [4] provides an extensive survey of various clustering techniques In this section, we highlight work done on document clustering Many clustering techniques have been applied to clustering documents For instance, Willett [9] provided a survey on applying hierarchical clustering algorithms into clustering documents Cutting et al [1] adapted various partition-based clustering algorithms to clustering documents Two of the techniques are Buckshot and Fractionation Buckshot selects a small sample of documents to pre-cluster them using a standard clustering algorithm and assigns the rest of the documents to the clusters formed Fractionation splits the N documents into ‘m’ buckets where each bucket contains N/m documents Fractionation takes an input parameter ρ, which indicates the reduction factor for each bucket The standard clustering algorithm is applied so that if there are ‘n’ documents in each bucket, they are clustered into n/ρ clusters Now each of these clusters is treated as if they were individual documents and the whole process is repeated until there are only ‘K’ clusters A relatively new technique is proposed by Zamir and Etzioni [10] They introduce the notion of phrasebased document clustering They use a generalized suffixtree to obtain information about the phrases and use them to cluster the documents 3 WBSC As the name suggests, WBSC uses a word-based approach to build clusters It first forms initial clusters of the documents, with each cluster representing a single word For instance, WBSC forms a cluster for the word ‘tiger’ made up of all the documents that contain the word ‘tiger’ After that, WBSC merges similar clusters –clusters are similar if they contain the similar set of documents – using a hierarchical based approach until some stopping criterion is reached At the end, the clusters are displayed based on the words associated with them Cluster Initialization: We represent each cluster as a vector of documents For each word, we create a bit vector (called term vector), with each dimension representing whether a certain document is present or absent (denoted by and 0, respectively) This step requires only one pass through the set of the documents However, we maintain some sort of index (e.g a hash table) on the list of words in the document to speed up the building process Notice that the number of clusters is independent to the number of documents However, it grows with the total number of words in all the documents Thus, we not consider stop-words words that either carry no meaning (like prepositions) or very common words The second idea can be extended Words that appear in too many documents are not going to help clustering Similarly, if a word appear in only one document, it is not likely that it will merge with other clusters and remain useless in clustering Thus, we remove all initial clusters that have only one document or have more than half of the total number of documents in it Cluster Building: Once the initial clusters are formed, we apply a hierarchical-based algorithm to combine the clusters Here we need a measure of similarity between any pair of clusters Let us assume we have two clusters A and B, with n1 and n2 documents each respectively (without loss of generality, assume n1 ≤ n2), with c documents in common Intuitively we would like to combine the two clusters if they have many common documents (relative to the size of the clusters) We tried c min( n1 , n ) as our similarity measure However, this measure does not perform well After trying with different measures, we settled with the Tanimoto measure [2], c n1 + n2 − c WBSC iterates over the clusters to merge similar clusters using this similarity measure Clusters are merged if their similarity value is higher than a pre-defined threshold We consider every pair of clusters, merging them if the similarity is larger than the threshold Also, there are cases when one cluster is simply the subset of the other We merge them in that case Currently we use 0.33 as our threshold We are experimenting with different ways of finding a better threshold Displaying results In the algorithm, we keep track of the words that are used to represent each cluster When two clusters merge, the new cluster acquires the words from both clusters Thus at the end of the algorithm, each cluster will have a set of representative words This set can be used to provide a description of the cluster Also, we found out that there are a lot of clusters that have very few words associated with them – as they have either never been merged, or are only merged 2-3 times Our results show that discarding such clusters does not affect the results significantly In fact, it allows the results to be interpreted more clearly Experiments & Results: This section describes the results of the various experiments that we carried out In order to evaluate the performance of WBSC, we compared it with other algorithms like K-Means, Fractionation and Buckshot [1] Experimental Setup: Our test bed consists of two separate datasets: • Web: This contains 2000 documents downloaded from the Web They are from different categories such as Food, Cricket, Astronomy, Clustering, Genetic Algorithms, Baseball, Movies, XML, Salsa (Dance) and Olympics We use search engines to help us locate the appropriate documents • Newsgroups: This contains 2000 documents from the UCI KDD archive (http://kdd.ics.uci.edu/) [8] It has documents retrieved from 20 different news groups All the experiments are carried out on a 733 MHz, 260 MB RAM PC We did experiments on different document sets of different sizes We ran the algorithm to get the clusters and tested the effectiveness of clustering (the types of clusters formed), both qualitatively and quantitatively We also compared the execution times of all the algorithms for document sets of different sizes Effectiveness of Clustering: We did many experiments with different number of documents taken from the above-mentioned test bed All the algorithms were run to produce the same number of clusters with same input parameters WBSC formed clusters for each of the different categories in the document sets, while the other algorithms (K-Means, Fractionation and Buckshot) did not In addition, the other algorithms formed clusters with documents of many categories that are not related to each other Figure shows sample output from one of the experiments with 500 documents picked from 10 of the categories of the newsgroups data set The categories include atheism, graphics, x-windows, hardware, space, electronics, Christianity, guns, baseball and hockey Cluster results: Keywords BTW, remind, NHL, hockey, Canada, Canucks, hockey, teams, players Human, Christian, bible, background, research, errors, April, President, members, member, private, community Cult, MAGIC, baseball, Bob, rules, poor, teams, players, season Analog, digital, straight, board, chip, processor Planes, pilots, ride, signals, offered Distribution, experience, anti, message, Windows, error, week, X Server, Microsoft, problems America, carry, isn, weapons, science, armed Military, fold, NASA, access, backup, systems, station, shuttle Notion, soul, several, hell, Personally, created, draw, exists, god Drivers, SVGA, conversion, VGA, compression, cpu, tools Keith, Jon, attempt, questions Limit, Police, Illegal, law Ohm, impedance, circuit, input Season, Lemieux, Bure, runs, wings Figure 1: clusters formed by WBSC on a sample from UCI data archive Baseball Hardware Electronics Windows-x Guns Space Atheism Graphics Atheism Guns Electronics Hockey documents in the matching The more documents that are matched, the more resemble the clusters are to the original categories For our algorithm, and for the purpose of this comparison, we assigned each document to the cluster that has the largest similarity value Figure shows the number of matches with the original categories for different algorithms for different sets of documents Number of mat ches/1 00 WBSC formed 14 clusters It formed more than one cluster for some of the categories like Hockey, Guns etc The document set on hockey has different documents related to Detroit Redwings and Vancouver Canucks Our algorithm formed two different clusters for these two different sets We ran the other algorithms with 10 as the required number of clusters to be formed Many of the clusters have documents that are not at all related to each other (they have documents from different categories) To show the effectiveness of our algorithm we run the various algorithms on our web data set, where some documents are in multiple categories (for example a document related to both baseball and movies should be in all the three clusters: baseball, movie, and baseballmovie) We cannot present the results due to limitation on space K-Means, Fractionation and Buckshot formed only clusters about baseball and Movies and did not form a cluster for documents related to both categories However, WBSC formed a cluster for baseball-movies and put the documents related to baseball-movies in that cluster This shows the effectiveness of our method We also measure the effectiveness of WBSC quantitatively We compared the clusters formed by the documents against the documents in the original categories and matched the clusters with the categories one-to-one For each matching, we counted the number of documents that are common in corresponding clusters The matching with the largest number of common documents is used to measure the effectiveness This matching can be found by a maximum weight bipartite matching algorithm [7] We return the number of Category Hockey Christian 60.00 50.00 40.00 30.00 20.00 10.00 0.00 100 WBSC 200 300 400 500 Number of documents K-Means Buckshot Fractionation Figure 2: Comparison of quality of clusters by different clustering algorithms We can clearly see that WBSC outperforms the other algorithms in effectiveness Execution Time: We also measured the execution time of various algorithms Figure gives the comparison of execution time As the graph shows, WBSC outperforms almost all Execution time (in minutes) other algorithms in execution time, especially as the number of documents increases 140.00 120.00 100.00 80.00 60.00 40.00 20.00 Conclusions In this paper, we presented WBSC, a word-based document-clustering algorithm We have shown that it is a promising means for clustering documents as well as presenting the clusters in a meaningful way to the user We have compared WBSC with other existing algorithms and have shown that it performs well, both in terms of running time and quality of the clusters formed Future work includes tuning the parameters of the algorithm, as well as adapting new similarity measures and data structures to speed up the algorithm 0.00 200 400 600 800 1000 1200 1400 Number of documents WBSC K-Means Buckshot Fractionation Figure 3: Execution times of various clustering algorithms Web Search Results: We also tested WBSC on the results from Web search engines We downloaded documents returned from the Google search engine (www.google.com) and we apply WBSC on them Limitation on space prohibits us from showing all the results Here we show some of the clusters found by WBSC by clustering the top 100 URLs returned from searching the term “cardinal” The categories correspond to the common usage of the word “cardinal” in documents over the Web (name of baseball teams, nickname for schools, a kind of bird, and a Catholic clergy) Figure shows some of the clusters formed by WBSC Cluster results: Keywords and sample Related documents topic Bishops, Catholic, world, Roman, Roman Church, cardinals, College, Holy Catholic Cardinals in the Catholic Church Church Catholicpages.com Cardinals The Cardinals of the Holy library Roman Church Dark, Bird, female, color Cardinal Cardinal Page Bird First Internet Bird Club Benes, rookie, players, hit, innings, St.Louis runs, Garrett, teams, league, Saturday Cardinals St.Louis Cardinals Notes (MLB National League Baseball - team) Brewers vs Cardinals Figure 4: Clusters formed for search term “cardinals” by WBSC This shows that our algorithm is effective even with the web search results References: [1] Douglass R Cutting, David R Karger, Jan O Pedersen, John W Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, In Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pp 318-329, June 1992 [2] Dean, P M Ed., Molecular Similarity in Drug Design, Blackie Academic & Professional, 1995, pp 111 –137 [3] D R Hill, A vector clustering technique, in: Samuelson (Ed.), Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968 [4] A.K Jain, M.N Murty and P.J Flynn, Data Clustering: A Review, ACM Computing Surveys 31(3): 264-323, Sept 1999 [5] F Murtagh, A Survey of Recent Advances in Hierarchical Clustering Algorithms", The Computer Journal, 26(4): 354-359, 1983 [6] J J Rocchio, Document retrieval systems – optimization and evaluation, Ph.D Thesis, Harvard University, 1966 [7] Robert E Tarjan, Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics, 1983 [8]http://kdd.ics.uci.edu/databases/20newsgroup s/20newsgroups.html, last visited November 18th, 2000 [9] P.Willett, Recent trends in hierarchical document clustering: a critical review, Information processing and management, 24: 577-97, 1988 [10] O.Zamir, O.Etzioni, Web document clustering: a feasibility demonstration, in Proceedings of 19th international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 1998, pp 46-54 ... and baseballmovie) We cannot present the results due to limitation on space K-Means, Fractionation and Buckshot formed only clusters about baseball and Movies and did not form a cluster for documents. .. technique, in: Samuelson (Ed.), Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968 [4] A. K Jain, M.N Murty and P.J Flynn, Data Clustering: A Review, ACM Computing... initial clusters are formed, we apply a hierarchical-based algorithm to combine the clusters Here we need a measure of similarity between any pair of clusters Let us assume we have two clusters A