Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 43 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
43
Dung lượng
2,75 MB
Nội dung
232 CHAPTER 8 Building a text analysis toolkit package com.alag.ci.textanalysis.lucene.impl; import com.alag.ci.textanalysis.InverseDocFreqEstimator; import com.alag.ci.textanalysis.Tag; public class EqualInverseDocFreqEstimator implements InverseDocFreqEstimator { public double estimateInverseDocFreq(Tag tag) { return 1.0; } } Listing 8.24 contains the interface for TextAnalyzer , the primary class to analyze text. package com.alag.ci.textanalysis; import java.io.IOException; import java.util.List; public interface TextAnalyzer { public List<Tag> analyzeText(String text) throws IOException; public TagMagnitudeVector createTagMagnitudeVector(String text) throws IOException; } The TextAnalyzer interface has two methods. The first, analyzeText , gives back the list of Tag objects obtained by analyzing the text. The second, createTagMagnitude- Vector , returns a TagMagnitudeVector representation for the text. It takes into account the term frequency and the inverse document frequency for each of the tags to compute the term vector. Listing 8.25 shows the first part of the code for the implementation of LuceneText- Analyzer , which shows the constructor and the analyzeText method. package com.alag.ci.textanalysis.lucene.impl; import java.io.*; import java.util.*; import org.apache.lucene.analysis.*; import com.alag.ci.textanalysis.*; import com.alag.ci.textanalysis.termvector.impl.*; public class LuceneTextAnalyzer implements TextAnalyzer { private TagCache tagCache = null; private InverseDocFreqEstimator inverseDocFreqEstimator = null; public LuceneTextAnalyzer(TagCache tagCache, InverseDocFreqEstimator inverseDocFreqEstimator) { this.tagCache = tagCache; this.inverseDocFreqEstimator = inverseDocFreqEstimator; Listing 8.23 The interface for the EqualInverseDocFreqEstimator Listing 8.24 The interface for the TextAnalyzer Listing 8.25 The core of the LuceneTextAnalyzer class Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 233Building the text analysis infrastructure } public List<Tag> analyzeText(String text) throws IOException { Reader reader = new StringReader(text); Analyzer analyzer = getAnalyzer(); List<Tag> tags = new ArrayList<Tag>(); TokenStream tokenStream = analyzer.tokenStream(null, reader) ; Token token = tokenStream.next(); while ( token != null) { tags.add(getTag(token.termText())); token = tokenStream.next(); } return tags; } protected Analyzer getAnalyzer() throws IOException { return new SynonymPhraseStopWordAnalyzer(new SynonymsCacheImpl(), new PhrasesCacheImpl()); } The method analyzeText gets an Analyzer . In this case, we use SynonymPhraseStop- WordAnalyzer . LuceneTextAnalyzer is really a wrapper class that wraps Lucene-specific classes into those of our infrastructure. Creating the TagMagnitudeVector from text involves computing the term frequencies for each tag and using the tag’s inverse doc- ument frequency to create appropriate weights. This is shown in listing 8.26. public TagMagnitudeVector createTagMagnitudeVector(String text) throws IOException { List<Tag> tagList = analyzeText(text); Map<Tag,Integer> tagFreqMap = computeTermFrequency(tagList); return applyIDF(tagFreqMap); } private Map<Tag,Integer> computeTermFrequency(List<Tag> tagList) { Map<Tag,Integer> tagFreqMap = new HashMap<Tag,Integer>(); for (Tag tag: tagList) { Integer count = tagFreqMap.get(tag); if (count == null) { count = new Integer(1); } else { count = new Integer(count.intValue() + 1); } tagFreqMap.put(tag, count); } return tagFreqMap; } private TagMagnitudeVector applyIDF(Map<Tag,Integer> tagFreqMap) { List<TagMagnitude> tagMagnitudes = new ArrayList<TagMagnitude>(); for (Tag tag: tagFreqMap.keySet()) { double idf = this.inverseDocFreqEstimator. estimateInverseDocFreq(tag); double tf = tagFreqMap.get(tag); Listing 8.26 Creating the term vectors in LuceneTextAnalyzer Analyze text to create tags Compute term frequencies Use inverse document frequency Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 234 CHAPTER 8 Building a text analysis toolkit double wt = tf*idf; tagMagnitudes.add(new TagMagnitudeImpl(tag,wt)); } return new TagMagnitudeVectorImpl(tagMagnitudes); } private Tag getTag(String text) throws IOException { return this.tagCache.getTag(text); } } To create the TagMagnitudeVector , we first analyze the text to create a list of tags: List<Tag> tagList = analyzeText(text); Next we compute the term frequencies for each of the tags: Map<Tag,Integer> tagFreqMap = computeTermFrequency(tagList); And last, create the vector by combining the term frequency and the inverse docu- ment frequency: return applyIDF(tagFreqMap); We’re done with all the classes we need to analyze text. Next, let’s go through an example of how this infrastructure can be used. 8.2.4 Applying the text analysis infrastructure We use the same example we introduced in section 4.3.1. Consider a blog entry with the following text (see also figure 8.2): Title: “Collective Intelligence and Web2.0” Body: “Web2.0 is all about connecting users to users, inviting users to participate, and applying their collective intelligence to improve the application. Collective intelligence enhances the user experience.” Let’s write a simple program that shows the tags associated with analyzing the title and the body. Listing 8.27 shows the code for our simple program. private void displayTextAnalysis(String text) throws IOException { List<Tag> tags = analyzeText(text); for (Tag tag: tags) { System.out.println(tag); } } public static void main(String [] args) throws IOException { String title = "Collective Intelligence and Web2.0"; String body = "Web2.0 is all about connecting users to users, " + " inviting users to participate and applying their " + " collective intelligence to improve the application." + " Collective intelligence" + " enhances the user experience" ; Listing 8.27 Computing the tokens for the title and body Method to display tags Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 235Building the text analysis infrastructure TagCacheImpl t = new TagCacheImpl(); InverseDocFreqEstimator idfEstimator = new EqualInverseDocFreqEstimator(); TextAnalyzer lta = new LuceneTextAnalyzer(t, idfEstimator); System.out.print("Analyzing the title \n"); lta.displayTextAnalysis(title); System.out.print("Analyzing the body \n"); First we create an instance of the TextAnalyzer class: TagCacheImpl t = new TagCacheImpl(); InverseDocFreqEstimator idfEstimator = new EqualInverseDocFreqEstimator(); TextAnalyzer lta = new LuceneTextAnalyzer(t, idfEstimator); Then we get the tags associated with the title and the body. Listing 8.28 shows the out- put. Note that the output for each tag consists of unstemmed text and its stemmed value. Analyzing the title [collective, collect] [intelligence, intellig] [ci, ci] [collective intelligence, collect intellig] [web2.0, web2.0] Analyzing the body [web2.0, web2.0] [about, about] [connecting, connect] [users, user] [users, user] [inviting, invit] [users, user] [participate, particip] [applying, appli] [collective, collect] [intelligence, intellig] [ci, ci] [collective intelligence, collect intellig] [improve, improv] [application, applic] [collective, collect] [intelligence, intellig] [ci, ci] [collective intelligence, collect intellig] [enhances, enhanc] [users, user] [experience, experi] It’s helpful to visualize the tag cloud using the infrastructure we developed in chap- ter 3. Listing 8.29 shows the code for visualizing the tag cloud. private TagCloud createTagCloud(TagMagnitudeVector tmVector) { List<TagCloudElement> elements = new ArrayList<TagCloudElement>(); for (TagMagnitude tm: tmVector.getTagMagnitudes()) { TagCloudElement element = new TagCloudElementImpl( tm.getDisplayText(), tm.getMagnitude()); elements.add(element); } return new TagCloudImpl(elements, new LinearFontSizeComputationStrategy(3,"font-size: ")); } private String visualizeTagCloud(TagCloud tagCloud) { HTMLTagCloudDecorator decorator = new HTMLTagCloudDecorator(); String html = decorator.decorateTagCloud(tagCloud); System.out.println(html); return html; } Listing 8.28 Tag listing for our example Listing 8.29 Visualizing the term vector as a tag cloud Creating instance of TextAnalyzer Create TagCloudElement instances Use decorator to visualize tag cloud Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 236 CHAPTER 8 Building a text analysis toolkit The code for generating the HTML to visualize the tag cloud is fairly simple, since all the work was done earlier in chapter 3. We first need to create a List of TagCloud- Element instances, by iterating over the term vector. Once we create a TagCloud instance, we can generate HTML using the HTMLTagCloudDecorator class. The title “Collective Intelligence and Web2.0” gets converted into five tags: [collec- tive, collect] [intelligence, intellig] [ci, ci] [collective intelligence, collect intellig] [web2.0, web2.0]. This is also shown in figure 8.12. Similarly, the body gets converted into 15 tags, as shown in figure 8.13. We can extend our example to compute the tag magnitude vectors for the title and body, and then combine the two vectors, as shown in listing 8.30. TagMagnitudeVector tmTitle = lta.createTagMagnitudeVector(title); TagMagnitudeVector tmBody = lta.createTagMagnitudeVector(body); TagMagnitudeVector tmCombined = tmTitle.add(tmBody); System.out.println(tmCombined); } The output from the second part of the program is shown in listing 8.31. Note that the top tags for this blog entry are users, collective, ci, intelligence, collective intelligence, and web2.0. [users, user, 0.4364357804719848] [collective, collect, 0.3842122429322726] [ci, ci, 0.3842122429322726] [intelligence, intellig, 0.3842122429322726] [collective intelligence, collect intellig, 0.3842122429322726] [web2.0, web2.0, 0.3345216912320663] [about, about, 0.1091089451179962] [applying, appli, 0.1091089451179962] [application, applic, 0.1091089451179962] [enhances, enhanc, 0.1091089451179962] [inviting, invit, 0.1091089451179962] Listing 8.30 Computing the TagMagnitudeVector Listing 8.31 Results from displaying the results for TagMagnitudeVector Figure 8.12 The tag cloud for the title, consisting of five tags Figure 8.13 The tag cloud for the body, consisting of 15 tags Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 237Use cases for applying the framework [improve, improv, 0.1091089451179962] [experience, experi, 0.1091089451179962] [participate, particip, 0.1091089451179962] [connecting, connect, 0.1091089451179962] The same data can be better visualized using the tag cloud shown in figure 8.14. So far, we’ve developed an infrastructure for analyzing text. The core infrastructure interfaces are independent of Lucene-specific classes and can be implemented by other text analysis packages. The text analysis infrastructure is useful in extracting tags and creating a term vector representation for the text. This term vector representa- tion is helpful for personalization, building predicting models, clustering to find pat- terns, and so on. 8.3 Use cases for applying the framework This has been a fairly technical chapter. We’ve gone through a lot of effort to develop infrastructure for text analysis. It’s useful to briefly review some of the use cases where this can be applied. This is shown in table 8.5. We’ve already demonstrated the process of analyzing text to extract keywords associ- ated with them. Figure 8.15 shows an example of how relevant terms can be detected and hyperlinked. In this case, relevant terms are hyperlinked and available for a user and web crawlers, inviting them to explore other pages of interest. There are two main approaches for advertising that are normally used in an appli- cation. First, sites sell search words—certain keywords that are sold to advertisers. Let’s say that the phrase collective intelligence has been sold to an advertiser. Whenever the Table 8.5 Some use cases for text analysis infrastructure Use case Description Analyzing a number of text documents to extract most- relevant keywords The term vectors associated with the documents can be combined to build a representation for the document set. You can use this approach to build an automated representation for a set of documents visited by a user, or for finding items similar to a set of documents. Advertising To show relevant advertisements on a page, you can take the keywords associated with the test and find the subset of keywords that have adver- tisements assigned. Classification and predictive models The term vector representation can be used as an input for building pre- dictive models and classifiers. Figure 8.14 The tag cloud for the combined title and body, consisting of 15 tags Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 238 CHAPTER 8 Building a text analysis toolkit user types collective intelligence in the search box or visits a page that’s related to collective intelligence, we want to show the advertisement related to this keyword. The second approach is to associate text with an advertisement (showing relevant products works the same way), analyze the text, create a term vector representation, and then associ- ate the relevant ad based on the main context of the page and who’s viewing it dynam- ically. This approach is similar to building a content-based recommendation system, which we do in chapter 12. In the next two chapters, we demonstrate how we can use the term vector represen- tation for text to cluster documents and build predictive models and text classifiers. 8.4 Summary Apache Lucene is a Java-based open source text analysis toolkit and search engine. The text analysis package for Lucene contains an Analyzer , which creates a Token- Stream . A TokenStream is an enumeration of Token instances and is implemented by a Tokenizer and a TokenFilter . You can create custom text analyzers by subclassing available Lucene classes. In this chapter, we developed two custom text analyzers. The first one normalizes the text, applies a stop word list, and uses the Porter stemming Detected Terms Figure 8.15 An example of automatically detecting relevant terms by analyzing text Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 239Resources algorithm. The second analyzer normalizes the text, applies a stop word list, detects phrases using a phrase dictionary, and injects synonyms. Next we discussed developing a text-analysis package, whose core interfaces are independent of Lucene. A Tag class is the fundamental building block for this pack- age. Tags that have the same stemmed values are considered equivalent. We intro- duced the following entities: TagCache , through which Tag instances are created; PhrasesCache , which contains the phrases of interest; SynonymsCache , which stores synonyms used; and InverseDocFreqEstimator , which provides an estimate for the inverse document frequency for a particular tag. All these entities are used by the TextAnalyzer to create tags and develop a term (tag) magnitude vector representa- tion for the text. The text analysis infrastructure developed can be used for developing the meta- data associated with text. This metadata can be used to find other similar content, to build predictive models, and to find other patterns by clustering the data. Having built the infrastructure to decompose text into individual tags and magnitudes, we next take a deeper look at clustering data. We use the infrastructure developed here, along with the infrastructure to search the blogosphere developed in chapter 5, in the next chapter. 8.5 Resources Ackerman, Rich. “Vector Model Information Retrieval.” 2003. http://www.hray.com/5264/ math.htm Gospodnetic, Otis, and Erik Hatcher. Lucene in Action. 2004. Manning. “Term vector theory and keywords.” http://forums.searchenginewatch.com/archive/ index.php/t-489.html Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 240 Discovering patterns with clustering It’s fascinating to analyze results found by machine learning algorithms. One of the most commonly used methods for discovering groups of related users or content is the process of clustering, which we discussed briefly in chapter 7. Clustering algo- rithms run in an automated manner and can create pockets or clusters of related items. Results from clustering can be leveraged to build classifiers, to build predic- tors, or in collaborative filtering. These unsupervised learning algorithms can pro- vide insight into how your data is distributed. In the last few chapters, we built a lot of infrastructure. It’s now time to have some fun and leverage this infrastructure to analyze some real-world data. In this chapter, we focus on understanding and applying some of the key clustering algorithms. This chapter covers ■ k-means, hierarchical clustering, and probabilistic clustering ■ Clustering blog entries ■ Clustering using WEKA ■ Clustering using the JDM APIs Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com 241Clustering blog entries K-means, hierarchical clustering, and expectation maximization (EM) are three of the most commonly used clustering algorithms. As discussed in section 2.2.6, there are two main representations for data. The first is the low-dimension densely populated dataset; the second is the high- dimension sparsely populated dataset, which we use with text term vectors and to rep- resent user click-through. In this chapter, we look at clustering techniques for both kinds of datasets. We begin the chapter by creating a dataset that contains blog entries retrieved from Technorati. 1 Next, we implement the k-means clustering algorithm to cluster the blog entries. We leverage the infrastructure developed in chapter 5 to retrieve blog entries and combine it with the text-analysis toolkit we developed in chapter 8. We also demonstrate how another clustering algorithm, hierarchical clustering, can be applied to the same problem. We look at some of the other practical data, such as user clickstream analysis that can be analyzed in a similar manner. Next, we look at how WEKA can be leveraged for clustering densely populated datasets and illustrate the process using the EM algorithm. We end the chapter by looking at the clustering- related interfaces defined by JDM and develop code to cluster instances using the JDM APIs. 9.1 Clustering blog entries In this section, we demonstrate the process of developing and applying various clus- tering algorithms by discovering groups of related blog entries from the blogosphere. This example will retrieve live blog entries from the blogosphere on the topic of “col- lective intelligence” and convert them to tag vector format, to which we apply differ- ent clustering algorithms. Figure 9.1 illustrates the various steps involved in this example. These steps are 1 Using the APIs developed in chapter 5 to retrieve a number of current blog entries from Technorati. 2 Using the infrastructure developed in chapter 8 to convert the blog entries into a tag vector representation. 3 Developing a clustering algorithm to cluster the blog entries. Of course, we keep our infrastructure generic so that the clustering algorithms can be applied to any tag vector representation. We begin by creating the dataset associated with the blog entries. The clustering algo- rithms implemented in WEKA are for finding clusters from a dense dataset. Therefore, we develop our own implementation for different clustering algorithms. We begin with implementing k-means clustering followed by hierarchical clustering algorithms. It’s helpful to look at the set of classes that we need to build for our clustering infrastructure. We review these classes next. 1 You can use any of the blog-tracking providers we discussed in chapter 5. Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com [...]... C2=Id=6 Collective Intelligence Applied to the Patent Process C2=Id=13 similarity=0.8021265050360485 C1=12 C2=5 C1=Id=12 similarity=0. 676 456586660 375 C1=11 C2 =7 C1=Id=11 similarity=0.554292 070 9331453 C1=4 C2=8 C1=Id=4 Collective Intelligence Applied to the Patent Process C2=Id=8 Collective Intelligence Applied to the Patent Process C2=Id =7 Collective Intelligence Applied to the Patent Process C2=Id=5 Collective. .. Social-networking, collective intelligence, social-software — dorai @ 10:28 am Tags: applicatio Id=3 Title=SAP Gets Business Intelligence What Do You Get? Excerpt=SAP Gets Business Intelligence What Do You Get? [IMG] Posted by: Michael Goldberg in News Title=SAP Gets Business Intelligence What Do You Get? Excerpt=SAP Gets Business Intelligence What Do You Get? [IMG] Posted by: Michael Goldberg in News Title=Che... if (!usedIndexes.containsKey(index)) { usedIndexes.put(index, index); return this.textDataSet.get(index); } } return null; } For each of the k clusters to be initialized, a data point is selected at random The algorithm keeps track of the points selected and ensures that the same point isn’t reselected Listing 9.12 shows the remaining code associated with the algorithm Listing 9.12 Recomputing the clusters... allAttributes; 264 CHAPTER 9 Discovering patterns with clustering Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com } private Instance createNewInstance(int numAttributes, Instances trainingDataSet, Collection allTags, TextDataItem dataItem) { Instance instance = new Instance(numAttributes); instance.setDataset(trainingDataSet); int index = 0; TagMagnitudeVector tmv = dataItem.getTagMagnitudeVector();... used for the learning process Listing 9.1 contains the definition for the Clusterer interface I DataSetCreator I TextDataItem createLearningData() getTagMagnitudeVector() getClusterId():Integer setClusterId (in clusterId:Integer):void getData():Object 0 * uses center I I Clusterer cluster() I 0 * TextCluster clearItems:void TagMagnitudeVector... new cluster to all the other clusters that can be merged Listing 9.20 shows the code associated with printing the details from the hierarchical clustering algorithm Listing 9.20 Printing the results from HierarchialClusteringImpl public String toString() { StringWriter sb = new StringWriter(); 260 CHAPTER 9 Discovering patterns with clustering Simpo PDF Merge and Split Unregistered Version - http://www.simpopdf.com... PDF Merge and Split Unregistered Version - http://www.simpopdf.com Listing 9.18 Creating the initial clusters in HierarchialClusteringImpl private void createInitialClusters() { createInitialSingleItemClusters(); computeInitialDistances(); } Create initial set of clusters Compute initial set of distances private void createInitialSingleItemClusters() { for (TextDataItem dataItem: this.textDataSet) {... tagged collective intelligence return bs.getRelevantBlogs(tagQueryParam); } The BlogDataSetCreatorImpl uses the APIs developed in chapter 5 to retrieve blog entries from Technorati It queries for recent blog entries that have been tagged with collective intelligence Listing 9 .7 shows the how blog data retrieved from Technorati is converted into a List of TextDataItem objects Listing 9 .7 Converting blog... random: this.intitializeClusters(); This is followed by reassigning the data items to the closest clusters: reassignClusters() and recomputing the centers of the cluster: computeClusterCenters() Listing 9.11 shows the code for initializing the clusters Listing 9.11 Initializing the clusters private void intitializeClusters() { this.clusters = new ArrayList(); Map usedIndexes... Unregistered Version - http://www.simpopdf.com 9.2 Leveraging WEKA for clustering Figure 7. 13 in section 7. 2.2 showed the classes associated with WEKA for clustering In this section, we work through the same example of clustering blog data using the WEKA APIs You may recall that a dataset in WEKA is represented by the Instances class Instances are composed of an Instance object, one for each data item Attributes . users, collective, ci, intelligence, collective intelligence, and web2.0. [users, user, 0.43643 578 0 471 9848] [collective, collect, 0.384212242932 272 6] [ci, ci, 0.384212242932 272 6] [intelligence, intellig,. [applying, appli] [collective, collect] [intelligence, intellig] [ci, ci] [collective intelligence, collect intellig] [improve, improv] [application, applic] [collective, collect] [intelligence, . figure 8.2): Title: Collective Intelligence and Web2.0” Body: “Web2.0 is all about connecting users to users, inviting users to participate, and applying their collective intelligence to improve