Text clustering using frequent weighted utility itemsets

Cybernetics and Systems An International Journal ISSN: 0196-9722 (Print) 1087-6553 (Online) Journal homepage: http://www.tandfonline.com/loi/ucbs20 Text Clustering Using Frequent Weighted Utility Itemsets Tram Tran, Bay Vo, Tho Thi Ngoc Le & Ngoc Thanh Nguyen To cite this article: Tram Tran, Bay Vo, Tho Thi Ngoc Le & Ngoc Thanh Nguyen (2017) Text Clustering Using Frequent Weighted Utility Itemsets, Cybernetics and Systems, 48:3, 193-209 To link to this article: http://dx.doi.org/10.1080/01969722.2016.1276774 Published online: 02 Mar 2017 Submit your article to this journal View related articles View Crossmark data Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ucbs20 Download by: [University of Newcastle, Australia] Date: 05 March 2017, At: 04:01 CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL 2017, VOL 48, NO 3, 193–209 http://dx.doi.org/10.1080/01969722.2016.1276774 Text Clustering Using Frequent Weighted Utility Itemsets Tram Trana, Bay Vob,c, Tho Thi Ngoc Led, and Ngoc Thanh Nguyene a University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam; Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam; cFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam; dFaculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam; eFaculty of Computer Science and Management, Wroclaw University of Science and Technology, Wroclaw, Poland b ABSTRACT KEYWORDS Text clustering is an important topic in text mining One of the most effective methods for text clustering is an approach based on frequent itemsets (FIs), and thus, there are many related algorithms that aim to improve the accuracy of text clustering However, these not focus on the weights of terms in documents, even though the frequency of each term in each document has a great impact on the results In this work, we propose a new method for text clustering based on frequent weighted utility itemsets (FWUI) First, we calculate the Term Frequency (TF) for each term in documents to create a weight matrix for all documents The weights of terms in documents are based on the Inverse Document Frequency Next, we use the Modification Weighted Itemset Tidset (MWIT)-FWUI algorithm for mining FWUI from a number matrix and the weights of terms in documents Finally, based on frequent utility itemsets, we cluster documents using the MC (Maximum Capturing) algorithm The proposed method has been evaluated on three data sets consisting of 1,600 documents covering 16 topics The experimental results show that our method, using FWUI, improves the accuracy of the text clustering compared to methods using FIs Frequent itemsets; frequent weighted utility itemsets; quantitative databases; text clustering; weight of terms Introduction Text clustering is widely studied in text mining due to its important roles in many applications such as spam filtering, claims investigations, or monitoring opinions Researchers have exploited different ways for text clustering, including applying common clustering algorithms for text domain, utilizing the nature of word patterns/context, and probabilistic approaches (Aggarwal and Zhai 2012) In this article, we approach the text clustering problem from the patterns of words in documents, specifically utilizing itemsets in text clustering Beil, Ester, and Xu (2002) introduced the frequent term-based clustering (FTC) approach for text clustering based on frequent terms Their experimental results show that, compared to bisecting K-means (Steinbach, Karypis, and Kumar 2000), FTC achieves higher accuracy and faster processing Their CONTACT Bay Vo vodinhbay@tdt.edu.vn Minh City 700000, Vietnam © 2017 Taylor & Francis Group, LLC Division of Data Science, Ton Duc Thang University, Ho Chi 194 T TRAN ET AL work inspires a new line of approaches for text clustering with some variations, such as Frequent Itemset-based Hierarchical Clustering (FIHC), Clustering based on Maximal Sequences (CMS), and Clustering based on Frequent Word Sequences (CFWS) Zhang et al (2010) analyzed some disadvantages of these approaches, such as, (1) FTC (Beil, Ester, and Xu 2002) causes isolated documents; (2) FIHC (Fung, Wang, and Ester 2003) cannot solve cluster conflicts; (3) CMS (Hernández-Reyes et al 2006) depends on the effectiveness of document representation; and (4) CFWS (Li, Chung, and Holt 2008) may produce trivial clustering results To overcome these issues, Zhang et al (2010) proposed the Maximum Capturing (MC) approach using frequent itemsets (FIs) Practically, previous works mainly focus on whether a term occurs in documents and count the frequencies of items in itemsets In this article, in addition to considering the frequencies of itemsets, as in previous works, we consider the weights of terms in documents to improve the performance of MC First, we utilize Term Frequency-Inverse Document Frequency (TF-IDF) as weights to mine frequent weighted utility itemsets (FWUIs) from a database of text documents Then, the resulting FWUIs are used with the MC approach for text clustering We evaluated our proposed method on three data sets, and the experimental results show that FWUI improves the performance of the MC for text clustering The contributions of our work are as follows: Generating a quantitative matrix for a collection of documents, where TFIDF (Term Frequency-Inverse Document Frequency) is used as the term weight; Applying the MWIT-FWUI (Modification Weighted Itemset TidsetFrequent Weighted Utility Itemset) algorithm for mining frequent utility itemsets in a weighted matrix; Applying the MC (Maximum Capturing) approach for text clustering; Evaluating our system on new data sets to measure its performance The rest of this article is organized as follows: Section “Related Concepts” reviews some related concepts used in this article, Section “Related Work” outlines related works on frequent itemset mining and frequent itemset-based text clustering, Section “Text Clustering Based on Frequent Weighted Utility Itemsets” describes our proposal, Section “Experiments” presents the experiments and evaluation of our approach in comparison to previous methods, and Section “Conclusions and Future Work” concludes this article as well as introduces some directions for future work Related Concepts Quantitative Transaction Databases A quantitative transaction database QD (Vo, Le, and Jung 2012) is defined as a triple QD ¼ hT, I, Wi, which contains a set of transactions T ¼ {t1, t2, … , tm}, CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL 195 a set of items I ¼ {i1, i2, … , in} and a set of weights W ¼ {w1, w2, … , wn} corresponding to items in I Each transaction has the form tk ¼ {xk1, xk2, … , xkn}, where xki is the quantity of item ith in transaction tk Intuitively, Table shows an example of a quantitative transaction database There are six transactions T ¼ {t1, t2,…, t6} and five items I ¼ {A, B, C, D, E} with corresponding weights W ¼ {0.4, 0.2, 0.1, 0.9, 0.5} in Table Transaction t1 ¼ {2, 0, 3, 0, 4} is interpreted as follows: in transaction t1, a customer purchases two items A, three items C, four items E, but not any of item B or D Term Frequency-Inverse Document Frequency (TF-IDF) The TF-IDF (Salton and McGill 1986) of a word is a score indicating the importance of that word/term in a document with regard to a collection of documents This score is the product of Term Frequency (TF) and Inverse Document Frequency (IDF) Term Frequency (TF), annotated as tf (t, d), is the number of occurrences of a term t in a document d, as computed by the following formula: tf ðt; dÞ ¼ nðt; dÞ nðdÞ ð1Þ where n (t, d) is the occurrences of term t in document d and n (d) is the total number of occurrences of all terms in document d IDF, annotated as idf (t, D), measures the informativeness of the term t in a collection of corpus D It is calculated as the logarithmically scaled inverse fraction of the number of documents in a corpus that contain the term t idf ðt; DÞ ¼ log jDj jfd Djt dgj ð2Þ where |D| is the number of documents in D and df (t, D) is the number of documents in D containing term t df ðt; DÞ ¼ jfd Djt dgj IDF score of a term indicates the importance in the collection of documents D, i.e., rare terms have high scores and frequent terms have low scores Table ITEM TID An example of quantitative transaction database A 0 B C 2 D 2 E 3 196 T TRAN ET AL Table An example of term weights Item A B C D E Weight 0.4 0.2 0.1 0.9 0.5 TF À IDFðt; d; DÞ ¼ tf ðt; dÞ � idf ðt; DÞ ð3Þ TF-IDF is the product of Term Frequency and Inverse Document Frequency Related Work Frequent Itemset Mining Approaches There are many approaches to mine FIs Agrawal and Srikant (1994) introduce Apriori for mining association rules from a database of sale transactions Apriori is the most basic join-based algorithm that identifies the frequent individual items in the database and extends the size of itemsets until they are still frequent Soon after, Park, Chen, and Yu (1995) proposed the Direct Hashing and Pruning (DHP) algorithm to optimize Apriori by pruning candidate itemsets in each iteration and trimming the transactions Another branch of approaches applies tree-based algorithms, is based on the concept of set-enumeration In this strategy, candidates are explored with the use of a subgraph of a lattice of itemsets As such, the problem of frequent itemset generation becomes that of constructing an enumeration or lexicographic tree Agrawal, Imieliński, and Swami (1993) introduced a simple version of a tree-based algorithm for mining the association rules of items in large databases, called the AIS algorithm The AIS algorithm constructs trees in a level-wise fashion, and itemsets at each level are counted using a transaction database Agarwal, Aggarwal, and Prasad (2001) proposed the TreeProjection algorithm to optimize the counting work at the lower levels of a tree by re-utilizing the counting work at previous levels Zaki et al (1997) proposed Eclat (an IT-tree approach) for quickly mining association rules, in which the database is scanned only once and candidates are not generated Eclat thus achieves better and faster performance than previous algorithms that require the generation of candidates Similarly, Han, Pei, and Yin (2000) introduced FP-Growth (an FP-tree-based approach) to mine frequent patterns without candidate generation, thus improving processing time and saving memory Constraint-based approaches for mining have been developed in recent years (Duong, Truong, and Vo 2014; Truong, Duong, and Ngan 2016) Tao, Murtagh, and Farid (2003) proposed the WARM (Weighted Association Rule Mining) approach to discover significant relationships in transaction CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL 197 databases, in which the weights of items are integrated in the mining process More recently, Vo, Tran, and Ngo (2013) proposed FWI (a WIT-tree-based approach) for quickly mining frequent weighted itemsets from weighted item transaction databases Vo et al (2013) also introduced the FWCI approach, which is an IT-tree-based approach for mining frequent weighted closed itemsets The FWCI approach has been improved by exploiting diffset and developing features for the fast removal of itemsets that are not closed (Vo 2017) Many proposals have been made for the problem of frequent weighted (closed) itemset mining, which is concerned with the weights (or benefit) of items, but not the quantity As such, Vo, Le, and Jung (2012) introduced FWUI based on MWIT-Tree FWUI is an extension of FWI based on the weighted utility of items for association rule mining, which is a development of frequent weighted itemsets Mining on FWUI considers both the quantity and weights of items Frequent Itemsets-Based Text Clustering Approaches Beil, Ester, and Xu (2002) proposed the FTC algorithm, which works in a bottom-up way FTC starts with an empty set and then continuously enrolls one element from the remaining frequent term sets until all items are assigned into clusters At each step, FTC selects the remaining FIs that cover the minimum overlap with the other cluster candidates Hernández-Reyes et al (2006) introduced a Maximal Frequent Sequence (MFS)-based approach for document clustering (CMS approach) Their approach makes use of Maximal Frequent Sequences as features in a vector space model and applies the K-means algorithm with cosine similarity to cluster documents Li, Chung, and Holt (2008) proposed the CFWS approach, where documents are treated as sequences of meaningful words instead of bags of words, and clustering is based on differences between the documents and transaction data set Zhang et al (2010) proposed the Maximum Capturing (MC) approach for text clustering based on utility itemsets MC assumes that two documents should be clustered together if they have maximum similarities Text Clustering Based on Frequent Weighted Utility Itemsets Figure shows an overview of our approach for text clustering using FWUI First, input documents are preprocessed for a set of terms Second, a weight is assigned to each term to indicate its importance in the document with regard to a corpus The output of this step is a weight matrix of all terms from all documents Third, from the weight matrix, we extract utility itemsets that are weighted as the benefit for the contents of documents Finally, documents are clustered based on the FWUI The following sections explain each step in 198 Figure T TRAN ET AL Diagram of text clustering using frequent weighted utility itemsets detail: Section “Preprocess Documents” describes the preprocessing of text; Section “Algorithm for Mining Frequent Weighted Utility Itemsets” describes how weights are assigned to terms and FWUI are extracted; and Section “Text Clustering Algorithm” explains the text clustering algorithm Preprocess Documents This step transforms each document into a set of words First, all documents are tokenized into words Note that tokenization is not always the same for all languages due to the different characteristics For example, English text is often separated by spaces, while Vietnamese text is not Figure presents an example of the text tokenization of a Vietnamese sentence and an English sentence with the same meaning We assume that the task of tokenization is done by employing existing tools When all documents are tokenized, we eliminate stopwords which serve grammatical roles or are not informative in the sentence Figure shows the same sentences as used above when stopwords are removed Algorithm for Mining Frequent Weighted Utility Itemsets As discussed in the first section, previous approaches consider either the weights of items or the weights of items in transactions For a quantitative database constructed from documents, we consider only the weights of items Figure An example of tokenization for Vietnamese and English texts Figure An example of eliminating stopwords for Vietnamese and English texts CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL Algorithm 199 Mining frequent weighted utility itemsets Input: Quantitative Transaction Database QD ¼ and threshold minwus Output: Set of frequent weighted utility itemsets U which satisfy threshold minwus Method: for document t ∈ T, term i ∈ I Compute tfidf (i, t, T); // Using formulas (1), (2) and (3) for document t ∈ T Compute twu (t); // Using formula (4) for term i ∈ I Compute wus (i); // Using formula (5) P ← {i|i ∈ I ∧ wus (i) ≥ minwus}; U ← MWIT − FWUI (P, minwus); Function MWIT − FWUI (Itemsets P, minwus) 10 for term wi ∈ P 11 W ← W ∪ wi; 12 P ← ∅; 13 for (wj ∈ P ∧ j > i) 14 X ¼ wi ∪ wj; 15 Compute wus (X); // Using formula (5) 16 if wus (X) ≥ minwus then 17 Pi ← Pi ∪ X; 18 U ← MWIT − FWUI (Pi, minwus); in transactions Hence, we apply the MWIT-FWUI algorithm for mining FWUI We modify the algorithm for mining FWUI using a matrix of term weights from documents, specifically, (1) changing the matrix of term frequencies into a matrix of term weights, and (2) using a new way to mine frequent utility itemsets based on the matrix of term weights The pseudocode of the proposed MWIT-FWUI algorithm is presented in Algorithm 1, with the details of the scores explained below Vo, Le, and Jung (2012) defined the transaction weighted utility twu of a transaction tk as P ij Sðtk Þ wj � xkij ð4Þ twuðtk Þ ¼ jtk j where xkij is the quantity of item ij, wj is the weight of item ij in transaction tk and |tk| is the total number of items in transaction tk The weighted utility Support wus of an itemset X is calculated as: P t 2tðX Þ twuðtk Þ ð5Þ wusðX Þ ¼ Pk tk 2T twuðtk Þ Text Clustering Algorithm Algorithm describes our clustering algorithm using FWUI First, we construct a similarity matrix A, where each element in the matrix is the common itemsets among documents Second, we find maximum and minimum similarities that are nonzero 200 T TRAN ET AL Algorithm Text clustering using FWUI Input: Set of documents D ¼ {d1, d2,…, dn}, frequent weighted utility itemsets W Output: Clusters of documents C ¼ {c1, c2,…, cm} Method: Construct similarity matrix A, where Aij is the common itemsets of documents di and dj; cluster 0; // This is the cluster number DP {(di, dj)|i < j ∧ di, dj ∈ D}; // All document pairs while DP ≠ ∅ min (A) ; max max (A); if max ¼ then Group all unclustered documents into a new cluster; else max > 10 P{(di, dj)|i < j ∧ Aij ¼ max}; // Pairs whose similarities are max 11 for p ∈ P // p ¼ di, dj) 12 if di, dj ∉ c, ∀ c ∈ C then 13 cluster cluster þ 1; 14 Aij cluster; 15 DP DP\{di, dj}; 16 if dk ∈ c|k ∈ {i, j}, c ∈ C then 17 c c ∪ {di, dj}; 18 Aij cluster number of c; 19 DP DP\{di, dj}; Third, if the maximum value is equal to the minimum value, all unclustered documents are grouped into a new cluster Otherwise, i.e., if the maximum value is greater than the minimum value, then if either document in a pair belongs to any cluster, the other document in that pair will be assigned to the same cluster If not, we form that pair as a new cluster The algorithm repeats these three steps until all documents are grouped into clusters Illustration Example Given a quantitative transaction database including two topics, as in Table 3, where each document is treated as a transaction, we have transaction/ document set T ¼ {d1, d2, … , d9} and itemsets I ¼ {″Paint″, ″Art″, ″Place″, ″Ball″, ″Match″} In this database, d1 ¼ {2, 0, 3, 0, 4} means d1 contains two Table ITEM TID Database of word occurrences in nine documents Paint 0 0 Art 2 Place 2 Ball 2 0 Match 3 0 CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL Table ITEM TID 201 TF scores of all words in each document Paint 0.222 0 0.375 0.167 0.286 0 Art 0.333 0.444 0.5 0.286 0.5 Place 0.333 0.222 0.125 0.333 0.429 0.5 0.444 Ball 0.167 0.222 0.25 0.5 0 0.333 Match 0.444 0.5 0.111 0.25 0.5 0 0.222 items “Paint,” three items “Place,” four items “Match,” and does not contain items “Art” or “Ball.” Mining Frequent Weighted Utility Itemsets Step 1: Formula (1) is used to calculate the Term Frequencies (TF scores) of words in a document For example, the TF scores of item “Paint” in documents d1, d4, d6, and d7 are calculated as follows: tf ðPaint; d1 Þ ¼ ; tf ðPaint; d4Þ ¼ ; tf ðPaint; d6Þ ¼ ; and tf ðPaint; d7Þ ¼ : Similarly, TF scores of all words are calculated and shown in Table Step 2: Since Inverse Document Frequency (IDF) is a unique score that indicates the importance of a word in a database, we use this score as the weight of a word Formula (2) is used to calculate the IDF scores of words For example, the IDF scores of words “Paint,” “Art,” “Place,” “Ball,” and “Match” in the database are: idf ðPaint; DÞ ¼ log 94 idf ðArt; DÞ ¼ log 95 idf ðPlace; DÞ ¼ log 97 idf ðBall; DÞ ¼ log 95 idf ðMatch; DÞ ¼ log 96 The weights of all items are calculated and shown in Table Table Item Paint Art Place Ball Match IDF scores of all words in the database Weight 0.35 0.25 0.11 0.25 0.18 202 T TRAN ET AL Step 2: From the scores in Tables and 5, we compute the transaction weighted utility of each document using formula (4) For example, the transaction weighted utilities of documents d1, d2, and d3 are: twu ðd1 Þ ¼ 0:222 � 0:35 þ 0:333 � 0:11 þ 0:444 � 0:18 ¼ 0:064 twu ðd2 Þ ¼ 0:333 � 0:25 þ 0:167 � 0:25 þ 0:5 � 0:18 ¼ 0:072 0:444 � 0:25 þ 0:222 � 0:11 þ 0:222 � 0:25 þ 0:111 � 0:18 ¼ 0:0535 twu ðd3 Þ ¼ Repeating this calculation for all documents in the database, we have the results in Table Step 3: From the scores in Tables and 6, we calculate the weighted utility support using formula (5), where each word is treated as an itemset For example, the word “Paint” occurs in documents d1, d4, d6, and d7, and the weighted utility support of this word is calculated as follows: wusðPaintÞ ¼ 0:064 þ 0:063 þ 0:074 þ 0:073 ¼ 0:42 0:657 Repeating this calculation for other single items, we have the following results: wus (Art) ¼ 0.61; wus (Place) ¼ 0.73; wus (Ball) ¼ 0.49 ; wus (Match) ¼ 0.64 Given minwus ¼ 0.2, all of these single items satisfy the threshold; therefore, they are added to itemset, / ¼ {“Paint”, “Art”, “Place”, “Ball”, “Match”} Considering the equivalence class of Paint: Paint joins Art, we have a new itemset Paint Art that occurs in document d7 with wus ðPaint ArtÞ ¼ 0:073 0:657 ¼ 0:12 < minwus , and thus, Paint Art is not added into Paint; Table Transaction weighted utility of all documents in the database TID SUM twu 0.064 0.072 0.0535 0.063 0.108 0.074 0.073 0.0915 0.058 0.657 CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL 203 Paint joins Place, we have a new itemset Paint Place that occurs in documents d1, d4, d6, and d7 with wus ðPaint PlaceÞ ¼ 0:064 þ 0:063 þ 0:074 þ 0:073 ¼ 0:42 > minwus ; hence, Paint Place is added into 0:657 Paint, then we have Paint ¼ {Paint Place}; Paint joins Ball, we have a new itemset Paint Ball that occurs in documents d4, d6, with wusðPaint BallÞ ¼ 0:063þ0:074 ¼ 0:21 > minwus ; hence, Paint Ball 0:657 is added into Paint, then we have Paint ¼ {Paint Place, Paint Ball}; Paint joins Match, we have a new itemset Paint Match that occurs in docuþ 0:063 ments d1 and d4 with wusðPaint MatchÞ ¼ 0:0640:657 ¼ 0:19 < minwus ; hence, Paint Match is not added into Paint By the same fashion, the algorithm recursively generates new equivalence classes after class Paint Considering the equivalence class of Paint Place: Paint Place joins Paint Ball, we have a new itemset Paint Place Ball that þ 0:074 occurs in documents d4 and d6 with wusðPaintPlaceBallÞ ¼ 0:0630:657 ¼ 0:21 > minwus , and so Paint Place Ball is added into Paint Place, then we have Paint Place ¼ {Paint Place Ball} Repeat the same procedure for equivalent classes Art, Place, Ball and Match Finally, we obtain FWUI satisfying minwus ¼ 0.2 As shown in Figure 4, FWUIs are in set {Paint, Art, Place, Ball, Match, Paint Place, Paint Ball, Art Place, Art Match, Place Ball, Place Match, Ball Match, Paint Place Ball, Place Ball Match} Text Clustering From Table and Figure 4, we construct a similarity matrix A, shown in Table 7, where Aij is the common itemsets between documents di and dj The clustering algorithm is then run on this matrix using the following steps: Figure Search tree of frequent weighted utility itemsets with minwus ¼ 0.2 204 Table T TRAN ET AL The similarity matrix of documents in the database 1 3 5 3 3 7 3 3 1 1 3 7 1 Step 1: Find the minimum value that is not in similarity matrix A; we get ¼ Step 2: Find the maximum value in A; we get max ¼ Step 3: Search for pairs of documents whose similarities are equal to max We obtain pairs (d3, d4), (d3, d9), (d4, d6), and (d4, d9), whose similarities are Hence, documents d3, d4, d6, and d9 are clustered into a group, and their similarities are now updated to zero (ref Table 8) At this step, documents d1, d2, d5, d7, and d8 are not clustered Step 4: Repeat Steps and on the current similarity matrix (ref Table 8) We get the new maximum value max ¼ 5, and pairs (d1, d4) and (d2, d3) have similarities equal to max Since documents d3 and d4 have been grouped into cluster {d3, d4, d6, d9}, we assign d1 and d2 to this cluster and update the similarities of these pairs to zero (ref Table 9) At this moment, we have one cluster {d1, d2, d3, d4, d6, d9} and three unclustered documents d5, d7, d8 Table The similarity matrix when documents d3, d4, d6, and d9 are clustered 1 Table 3 5 3 3 0 3 3 1 1 3 0 1 The similarity matrix when documents d1, d2, d3, d4, d6, and d9 are clustered 1 2 3 3 3 0 3 3 1 1 3 0 1 CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL Table 10 205 The similarity matrix when all documents are clustered 1 0 0 0 0 0 1 1 0 0 1 Step 5: Repeat Steps and on the current similarity matrix (ref Table 9); we get the new maximum value max ¼ Document pairs whose similarities are equal to max are (d1, d3), (d1, d6), (d1, d7), (d1, d9), (d2, d4), (d2, d5), (d2, d9), (d3, d5), (d3, d6), (d3, d7), (d3, d8), (d4, d7), (d6, d7), (d6, d9), and (d7, d8) In which, documents d7 and d8 in pair (d7, d8) not belong to any clusters Thus, they are grouped as a new cluster, and the value of this pair is updated as zero in the matrix We observe that in pair (d2, d5), document d2 belongs to cluster {d1, d2, d3, d4, d6, d9} Hence, document d5 is added to this cluster, and the value of pair (d2, d5) in the similarity matrix is updated as zero (ref Table 10) At this moment, we have two clusters: {d1, d2, d3, d4, d5, d6, d9} and {d7, d8} Since all documents are delivered to clusters, the algorithm is terminated Experiments Data for experiments are downloaded from digital newspapers: “www vnexpress.net,” “dantri.com.vn,” and “thanhnien.vn.” These data consisted of 1,600 documents in 16 topics These were turned into three data sets, as shown in Table 11 Table 11 Data set Total Experiment data sets No 10 11 12 13 14 15 16 16 Topics Life Arts Virus - Hacker Sports Tennis Medical Music Movie and Stage Informatics Fashion Football Real estate Cuisine Influenza Auto-mobile Crime 16 # of documents 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 1,600 206 T TRAN ET AL To evaluate our proposed approach of text clustering using FWUI, we also adopt the F-measure (Steinbach, Karypis, and Kumar 2000), which is a harmonic function of Precision and Recall Precision and Recall are computed as follows [Eqs (8) and (9)]: � Pði; jÞ � Rði; jÞ F ði; jÞ ¼ Pði; jÞ þ Rði; jÞ X ni maxj Fði; jÞ F¼ n i nij Pði; jÞ ¼ nj nij Rði; jÞ ¼ ni ð6Þ ð7Þ ð8Þ ð9Þ where P (i, j) is the precision of cluster j in class i, R (i, j) is the recall of class i in cluster j, nij is the number of documents of class i in cluster j, nj is the number of documents of cluster j, and niis the number of documents of class i The F-measure of cluster j in class i, F (i, j), is defined as Eq (6) The F-measure of clustering in the whole data set is computed as Eq (7) where n is the total number of documents in the data set Generally, the higher the F-measure, the better the clustering performance is for the data set We implement our system in a system running the.NET Framework 4.0 and Windows 10—64 bit, with Intel Core i5 and 4GB RAM The vnTokenizer tool1 (Le et al 2008) is used for tokenizing Vietnamese text Vietnamese stopwords are combined from existing lists2,3 including 880 words Results and Evaluations Text clustering has shown better performance when based on utility itemsets than on bisecting K-means (Steinbach, Karypis, and Kumar 2000), and we compare our system with an FI-based text clustering approach proposed by Zhang et al (2010) The experimental results on three data sets are shown in Tables 12–14, with different performances on different threshold minwus Note that when using different values of minwus, the clustering has different performances, and determining a suitable value for minwus remains a challenge for either mining FIs and text clustering using FIs In the tables, we show the performances when clustering using FIs and FWUI Regarding data set 1, consisting of 400 documents in four topics, the clustering performance is quite high As shown in Table 12, clustering using VnTokenizer can be found at http://mim.hus.vnu.edu.vn/phuonglh/softwares/vnTokenizer http://xltiengviet.wikia.com/wiki/Danh_s%C3%A1ch_stop_word https://github.com/trungvv91/nlp/blob/master/train-data/VNstopwords.txt CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL Table 12 MS F FI FWUI Experimental results on data set 0.15 0.57 0.7 Table 13 MS F FI FWUI 0.14 0.59 0.76 0.13 0.52 0.78 0.12 0.56 0.72 0.11 0.64 0.82 0.1 0.6 0.8 0.09 0.64 0.74 0.085 0.61 0.82 0.08 0.62 0.83 0.07 0.63 0.84 0.05 0.68 0.83 0.1 0.41 0.48 0.095 0.4 0.49 0.09 0.41 0.49 0.085 0.49 0.49 0.08 0.42 0.48 0.07 0.47 0.5 0.04 0.5 0.54 0.035 0.54 0.59 Experimental results on data set 0.15 0.46 0.48 Table 14 MSF FI FWUI 207 0.14 0.4 0.47 0.13 0.51 0.51 0.12 0.5 0.5 0.11 0.42 0.48 Experimental results on data set 0.1 0.37 0.43 0.095 0.4 0.46 0.09 0.37 0.54 0.085 0.37 0.52 0.08 0.37 0.56 0.075 0.38 0.57 0.07 0.44 0.53 0.065 0.48 0.51 0.06 0.52 0.53 0.05 0.5 0.57 0.045 0.49 0.54 FIs achieves the best F-measure at 0.68, while clustering using FWUI achieves a higher score, i.e., 0.84 For data set 2, including 500 documents in five topics, the best clustering performance for both features FI and FWUI is 0.51 (ref Table 13) With regard to the clustering performance with data set 3, Table 14 shows that the best performances are 0.54 and 0.59 when clustering using FI and FWUI, respectively From these results, we are able to conclude that using FWUI is better than using FI for text clustering In other words, FWUI is beneficial for text clustering Conclusions and Future Work In this article, we propose a new approach to clustering documents based on FWUI First, the TF-IDF scores of all words in the documents are computed to generate a weight matrix for a collection of documents Second, we introduce the MWIT-FWUI algorithm to mine FWUI from the weight matrix Last, the Maximal Capturing algorithm is used to cluster documents based on FWUI Our experiments show improvements when using FWUI for document clustering, i.e., a better F-measure, compared to using FI without weights For the next step, we are going to explore the use of other linguistic features for text clustering, such as content words and name entities In addition, we are going to apply FWUI to other problems, such as document retrieval, automatic text summarization, and text classification Funding This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2015.10 208 T TRAN ET AL References Agarwal, R C., C C Aggarwal, and V V V Prasad 2001 A tree projection algorithm for generation of frequent itemsets Journal of Parallel and Distributed Computing 61 (3) (March 2001):350–71 doi:10.1006/jpdc.2000.1693 Aggarwal, C C., and C Zhai 2012 A survey of text clustering algorithms In Mining Text Data, 77–128 doi:10.1007/978-1-4614-3223-4_4 Agrawal, R., T Imieliński, and A Swami 1993 Mining association rules between sets of items in large databases In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (SIGMOD ‘93):207–16 Agrawal, R., and R Srikant 1994 Fast algorithms for mining association rules in large databases In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ‘94:487–99 Beil, F., M Ester, and X Xu 2002 Frequent term-based text clustering In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘02:436–42 Duong, V H., T C Truong, and B Vo 2014 An efficient method for mining frequent itemsets with double constraints Engineering Applications of Artificial Intelligence 27:148–154 doi:10.1016/j.engappai.2013.09.006 Fung, B C M., K Wang, and M Ester 2003 Hierarchical document clustering using frequent itemsets In Proceedings of the Third SIAM International Conference on Data Mining, 59–70 Han, J., J Pei, and Y Yin 2000 Mining frequent patterns without candidate generation In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ‘00:1–12 Hernández-Reyes, E., R A García-Hernández, J A Carrasco-Ochoa, and J F M Trinidad 2006 Document clustering based on maximal frequent sequences In Advances in Natural Language Processing, 5th International Conference on NLP, FinTAL 2006:257–67 Le, H P., N T M Huyen, A Roussanaly, and H T Vinh 2008 A hybrid approach to word segmentation of Vietnamese texts Language and Automata Theory and Applications, Second International Conference, LATA 2008 Revised Papers:240–49 Li, Y., S M Chung, and J D Holt 2008 Text document clustering based on frequent word meaning sequences Data & Knowledge Engineering 64 (1):381–404 doi:10.1016/j.datak 2007.08.001 Park, J S., M Chen, and P S Yu 1995 An effective hash-based algorithm for mining association rules In Proceedings of the 1995 ACM SIGMOD international conference on Management of data (SIGMOD ‘95):175–86 Salton, G., and M J McGill 1986 Introduction to modern information retrieval New York: McGraw-Hill Steinbach, M., G Karypis, and V Kumar 2000 A comparison of document clustering techniques In KDD Workshop on Text Mining Tao, F., F Murtagh, and M Farid 2003 Weighted association rule mining using weighted support and significance framework In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘03:661–66 Truong, C T., V H Duong, and H N T Ngan 2016 Structure of frequent itemsets with extended double constraints Vietnam Journal of Computer Science (2):119–35 doi:10.1007/s40595-015-0056-7 Vo, B 2017 An efficient method for mining frequent weighted closed itemsets from weighted items transaction databases Journal of Information Science and Engineering 33 (1):199–216 CYBERNETICS AND SYSTEMS: AN INTERNATIONAL JOURNAL 209 Vo, B., F Coenen, and B Le 2013 A new method for mining frequent weighted itemsets based on wit-trees Expert Systems with Applications 40 (4):1256–64 doi:10.1016/j.eswa 2012.08.065 Vo, B., B Le, and J J Jung 2012 A tree-based approach for mining frequent weighted utility itemsets In Proceedings of the 4th International Conference on Computational Collective Intelligence: Technologies and Applications—Volume Part I, ICCCI’12:114–23 Vo, B., N Tran, and D Ngo 2013 Mining frequent weighted closed itemsets In Advanced Computational Methods for Knowledge Engineering 479:379–90 Zaki, M J., S Parthasarathy, M Ogihara, and W Li 1997 New algorithms for fast discovery of association rules In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, KDD’97:283–86 Zhang, W., T Yoshida, X Tang, and Q Wang 2010 Text clustering using frequent itemsets Knowledge–Based Systems 23 (5):379–88 doi:10.1016/j.knosys.2010.01.011 ... our method, using FWUI, improves the accuracy of the text clustering compared to methods using FIs Frequent itemsets; frequent weighted utility itemsets; quantitative databases; text clustering; ... of text clustering using frequent weighted utility itemsets detail: Section “Preprocess Documents” describes the preprocessing of text; Section “Algorithm for Mining Frequent Weighted Utility Itemsets ... Work” outlines related works on frequent itemset mining and frequent itemset-based text clustering, Section Text Clustering Based on Frequent Weighted Utility Itemsets describes our proposal,

Định dạng
Số trang	18
Dung lượng	0,91 MB