1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Data Analysis Machine Learning and Applications Episode 3 Part 5 pdf

25 209 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 25
Dung lượng 541,39 KB

Nội dung

578 Berenike Loos and Chris Biemann As an application, we operate on automatic address extraction from web pages for the tourist domain. 1.1 Motivation: Address extraction from the web In an open-domain spoken dialog system, the automatic learning of ontological con- cepts and corresponding relations between them is essential as a complete manual modeling of them is neither practicable nor feasible due to the continuously chang- ing denotation of real world objects. Therefore, the emergence of new entities in the world entails the necessity of a method to deal with those entities in a spoken dialog system as described in Loos (2006). As a use case to this challenging problem we imagine a user asking the dialog system for a newly established restaurant in a city, e.g. (“How do I get to the Auer- stein"). So far, the system does not have information about the object and needs the help of an incremental learning component to be able to give the demanded answer to the user. A classification as well as any other information for the word “Auerstein" are hitherto not modeled in the knowledge base and can be obtained by text mining methods as described in Faulhaber et al. (2006). As soon as the object is classified and located in the system’s domain ontology, it can be concluded that it is a building and that all buildings have addresses. At this stage the herein described work comes into play, which deals with the extraction of addresses in unstructured text. With a web service (as part of the dialog system’s infrastructure) the newly found address for the demanded object can be used for a route instruction. Even though structured and semi-structured texts such as online directories can be harvested as well, they often do not contain addresses of new places and do, therefore, not cover all addresses needed. However, a search in such directories can be used in combination with the method described herein, which can be used as a fallback solution. 1.2 Unsupervised learning supporting supervised methods Current research in supervised approaches to NLP often tries to reduce the amount of human effort required for collecting labeled examples by defining methodologies and algorithms that make a better use of the training set provided. Another promis- ing direction to tackle this problem is to empower standard learning algorithms by the addition of unlabeled data together with labeled texts. In the machine learning literature, this learning scheme has been called semi-supervised learning (Sarkar and Haffari, 2006). The underlying idea behind our approach is that syntactic and seman- tic similarity of words is an inherent property of corpora, and that it can be exploited to help a supervised classifier to build a better categorization hypothesis, even if the amount of labeled training data provided for learning is very low. We emphasize that every contribution to widening the acquisition bottleneck is useful, as long as its application does not cause more extra work than the contribution is worth. Here, we provide a methodology to plug an unsupervised tagger into an address extraction system and measure its contribution. Supporting Web-based Address Extraction with Unsupervised Tagging 579 2 Data preparation In our semi-supervised setting, we require two different data sets: a small, manually annotated dataset used for training our supervised component, and a large, unanno- tated dataset for training the unsupervised part of the system. This section describes how both datasets were obtained. For both datasets we used the results of Google queries for places as restaurants, cinemas, shops etc. To obtain the annotated data set, 400 of the resulting Google pages for the addresses of the corresponding named entities were annotated manually with the labels: street , house , zip and city , all other tokens received the label O . As the unsupervised learning method is in need of large amounts of data, we used a list with about 20,000 Google queries each returning about 10 pages to obtain an appropriate amount of plain text. After filtering the resulting 700 MB raw data for German language and applying cleaning procedures as described in (Quasthoff et al., 2006) we ended up with about 160 MB totaling 22.7 million tokens. This corpus was used for training the unsupervised tagger. 3 Unsupervised tagging 3.1 Approach Unlike in standard (supervised) tagging, the unsupervised variant relies neither on a set of predefined categories nor on any labeled text. As a tagger is not an application of its own right, but serves as a pre-processing step for systems building upon it, the names and the number of syntactic categories is very often not important. The system presented in Biemann (2006a) uses Chinese Whispers clustering (Biemann, 2006b) on graphs constructed by distributional similarity to induce a lex- icon of supposedly non-ambiguous words with respect to part of speech (PoS) by selecting only safe bets and excluding questionable cases from the category build- ing process. In this implementation two clusterings are combined, one for high and medium frequency words, the other collecting medium and low frequency words. High and medium frequency words are clustered by similarity of their stop word context feature vectors: a graph is built, including only words that are endpoints of high similar pairs. Clustering this graph of typically 5,000 vertices results in several hundred clusters, which are subsequently used as PoS categories. To extend the lex- icon, words of medium and low frequency are clustered using a graph that encodes similarity of significant neighbor co-occurrences (as defined in Dunning, 1993). Both clusterings are mapped by overlapping elements into a lexicon that provides PoS in- formation for some 50,000 words. For obtaining a clustering on datasets of this size, an effective algorithm like Chi- nese Whispers is crucial. Increased lexicon size is the main difference between this and other approaches (e.g. (Schütze, 1995), (Freitag , 2004)), that typically operate with 5,000 words. Using the lexicon, a trigram tagger with a morphological exten- sion is trained, which can be used to assign tags to all tokens in a text. The tag sets 580 Berenike Loos and Chris Biemann obtained with this method are usually more fine-grained than standard tag sets and reflect syntactic as well as semantic similarity. In Biemann (2006a), the tagger output was directly evaluated against supervised taggers for English, German and Finnish via information-theoretic measures. While it is possible to relatively compare the per- formance of different components of a system or different systems along this scale, it does only give a poor impression on the utility of the unsupervised tagger’s output. Therefore, an application-based evaluation is undertaken here. 3.2 Resulting tagset As described in Section 2, we had a relatively small corpus in comparison to previ- ous work with the same tagger, that typically operates on about 50 million tokens. Nonetheless, the domain specifity of the corpus leads to an appropriate tagging, which can be seen in the following examples from the resulting tag set (numbers in brackets give the words in the lexicon per tag): 1. Nouns: Verhandlungen, Schritt, Organisation, Lesungen, Sicherung, (800) 2. Verbs: habe, lernt, wohnte, schien, hat, reicht, suchte (191) 3. Adjectives: französischen, künstlerischen, religiösen (142) 4. locations: Potsdam, Passau, Innsbruck, Ludwigsburg, Jena (320) 5. street names: Bismarckstr, Leonrodstr, Schillerstr, Ungererstr (150) On the one hand, big clusters are formed that contain syntactic tags as shown for the example tags 1 to 3. Items 4 and 5 show that not only syntactic tags are created by the clustering process, but also domain specific tags, which are useful for an address extraction. Note that the actual tagger is capable of tagging all words, not only words in the lexicon – the number of words in the lexicon are merely the number of types used for training. We emphasize that the comparatively small training corpus (usually, 50M–500M tokens are employed) leaves room for improvements, as more training text showed to have a positive impact on tagging quality in previous studies. 4 Experiments and evaluation This section describes the supervised system, the evaluation methodology and the results we obtained in a comparative evaluation of either providing or not providing the unsupervised tags. 4.1 Conditional random field tagger We perceived address extraction as a tagging task: labels indicating city , street , house number, zip code or other ( O ) from the training set are learned and applied to unseen examples. Note that this is not comparable to a standard task like Named Entity Recognition (cf. Roth and van den Bosch, 2002), since we are only interested in labeling the address of the target location, and not other addresses that might be Supporting Web-based Address Extraction with Unsupervised Tagging 581 contained in the same document. Rather, this is an instance of Information Extraction (see Grishman, 1997). For performing the task, we train the MALLET tagger (Mc- Callum, 2002), which is based on Conditional Random Fields (CRFs, see Lafferty et al. 2001). CRFs define a conditional probability distribution over label sequences given a particular observation sequence. CRFs have been proven to have equal or superior performance at tagging tasks as compared to other systems like Hidden Markov Models or the Maximum Entropy Framework. The flexibility of CRFs to in- clude arbitrary, non-independent features allows us to supply unsupervised tags or no tags to the system without changing the overall architecture. The tagger can operate on a different set of features ranging over different distances. The following features per instance are made available to the CRF: • word itself • relative position to target name • unsupervised tag We experimented with different orders as well as with different time shifts. CRF order The order of the CRF defines how many preceding labels are used for the determi- nation of the current label. An order of 1 means that only the previous label is used, order 2 allows for the usage of two previous labels etc. As higher orders mean more information, which is in turn supported by fewer training examples, an optimum at some small order can be expected. Time shifting Time shifting is an operation that allows the CRF to use not only the features for the current position, but also features from surrounding positions. This is reached by copying the features from surrounding positions, indicating what relative position they were copied from. As with orders, an optimum can be expected for some small range of time shifting, exhibiting the same information/sparseness trade-off. For il- lustration, the following listing shows an original training instance with time shift 0, as well as the same instance with time shifts -2, -1, 0, 1, 2, for the scenario with un- supervised tags. Note that relative positions are not copied in time-shifting because of redundancy. The following items show these shifts: • shift 0: – Extrablatt 0 T115 O – 531T215 house – Hauptstr 2 T64 street – Heidelberg 3 T15 city – 69117 4 T215 zip • shift 1: – 1 -1:Extrablatt -1:T115 0:53 0:T215 1:Hauptstr 1:T64 house 582 Berenike Loos and Chris Biemann – 2 -1:53 -1:T215 0:Hauptstr 0:T64 1:Heidelberg 1:T15 street • shift 2: – 1 -2:Cafe -2:T10 -1:Extrablatt -1:T115 0:53 0:T215 1:Hauptstr 1:T64 2:Heidelberg 2:T15 house – 2 -2:Extrablatt -2:T115 -1:53 -1:T215 0:Hauptstr 0:T64 1:Heidelberg 1:T15 2:69117 2:T215 street In the example for shift 0 a full address with all features is shown: word, relative position to target "Extrablatt", unsupervised tag and classification label. For exem- plifying shifts 1 and 2, only two lines are given, with -2:, -1:, 0:, 1: and 2: being the relative position of copied features. In the scenario without unsupervised tags all features "T<number>" are omitted. 4.2 Evaluation methodology For evaluation, we split the training set into 5 equisized parts and performed 5 sub- experiments per parameter setting and scenario, using 4 parts for training and the remaining part for evaluation in a 5-fold-cross-validation fashion. The split was per- formed per target location: locations in the test set were never contained in the train- ing. To determine our system’s performance, we measured the amount of correctly classified, incorrectly classified (false positives) and missed (false negatives) in- stances per class and report the standard measures Precision, Recall and F1-measure as described in Rijsbergen (1979). The 5 sub-experiments were combined and checked against the full training set. 4.3 Results Our objective is to examine to what extent the unsupervised tagger influences clas- sification results. Conducting the experiments with different CRF parameters as out- lined in Section 4.1, we found different behaviors for our four target classes: whereas for street and house number, results were slightly better in the second order CRF experiments, the first order CRF scored clearly higher for city and zip code. Re- stricting experiments to first order CRFs and regarding different shifts, a shift of 2 in both directions scored best for all classes except city , where both shift 0 and 1 resulted in slightly higher scores. The best overall setting, therefore, was determined to be the first order CRF with a shift of 2. For this setting, Figure 1 presents the results in terms of precision, recall and F1. What can be observed not only from Figure 1 but also for all parameter settings is the following: Using unsupervised tags as features as compared to no tagging leads to a slightly decreased precision but a substantial increase in recall, and always affects the F1 measure positively. The reason can be sought in the generalization power of the tagger: having at hand syntactic-semantic tags instead of merely plain words, the system is able to classify more instances correctly, as the tag (but not the word) has occurred with the correct classification in the training set before. Due to overgeneralization or tagging errors, however, precision is decreased. The effect is Supporting Web-based Address Extraction with Unsupervised Tagging 583 Fig. 1. Results in precision, recall and F1 for all classes, obtained with first order CRF and a shift of 2. strongest for street with a loss of 7% in precision with a recall boost of 14%. In general, unsupervised tagging clearly helps at this task, as a little loss in precision is more than compensated with a boost in recall. 5 Conclusion and further work In this research we have shown that the use of large, unannotated text can improve classification results on small, manually annotated training sets via building a tag- ger model with unsupervised tagging and using the unsupervised tags as features in the learning algorithm. The benefit of unsupervised tagging is especially significant in domain-specific settings, where standard pre-processing steps such as supervised tagging do not capture the abstraction granularity necessary for the task, or simply no tagger for the target language is available. For further work, we aim at combining the possibly several addresses per target location. Given the evaluation values obtained with our method, the task of dynamically extracting addresses from web-pages to support address search for the tourist domain is feasible and a valuable, dynamic add-on to directory-based address search. References BIEMANN, C. (2006a): Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering. Proc. COLING/ACL-06 SRW, Sydney, Australia. BIEMANN, C. (2006b): Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Proceedings of the HLT-NAACL- 06 Workshop on Textgraphs, New York, USA. 584 Berenike Loos and Chris Biemann DUNNING, T. (1993): Accurate Methods for the Statistics of Surprise and Coincidence. Com- putational Linguistics 19(1), pp. 61–74. FAULHABER A., LOOS B., PORZEL R., MALAKA, R. (2006): Towards Understanding the Unknown: Open-class Named Entity Classification in Multiple Domains. Proceedings of the Ontolex Workshop at LREC, Genova, Italy FREITAG, D. (2004): Toward unsupervised whole-corpus tagging. Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland GRISHMAN, R. (1997): Information Extraction: Techniques and Challenges. In Maria Teresa Pazienza (ed.) Information Extraction. Springer-Verlag, Lecture Notes in Artificial Intel- ligence, Rome LAFFERTY, J. and McCALLUM, A. K. and PEREIRA, F. (2001): Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of ICML- 01, pp. 282–289. LOOS, B. (2006): On2L – A Framework for Incremental Ontology Learning in Spoken Dialog Systems. Proc. COLING/ACL-06 SRW, Sydney, Australia MCCALLUM, A. K. (2002): MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu. QUASTHOFF, U., RICHTER, M. and BIEMANN, C. (2006): Corpus Portal for Search in Monolingual Corpora. Proceedings of LREC-06, Genoa, Italy ROTH, D. and VAN DEN BOSCH, A. (Eds.) (2002): Proceedings of the Sixth Workshop on Computational Language Learning (CoNNL-02), Taipei, Taiwan. SARKAR, A. and HAFFARI, G. (2006): Inductive Semi-supervised Learning Methods for Natural Language Processing. Tutorial at HLT-NAACL-06, NYC, USA. SCHÜTZE, H. (1995): Distributional part-of-speech tagging. Proceedings of the 7th Con- ference on European chapter of the Association for Computational Linguistics, Dublin, Ireland VAN RIJSBERGEN, C. J. (1979): Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow. Text Mining of Supreme Administrative Court Jurisdictions Ingo Feinerer and Kurt Hornik Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, A-1090 Wien, Austria {h0125130, Kurt.Hornik}@wu-wien.ac.at Abstract. Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company’s legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Austrian supreme administra- tive court jurisdictions concerning dues and taxes. We analyze the law corpora using R with the new text mining package tm. Applications include clustering the jurisdiction documents into groups modeling tax classes (like income or value-added tax) and identifying jurisdiction properties. The findings are compared to results obtained by law experts. 1 Introduction A thorough discussion and investigation of existing jurisdictions is a fundamental ac- tivity of law experts since convictions provide insight into the interpretation of legal statutes by supreme courts. On the other hand, text mining has become an effective tool for analyzing text documents in automated ways. Conceptually, clustering and classification of jurisdictions as well as identifying patterns in law corpora are of key interest since they aid law experts in their analyses. E.g., clustering of primary and secondary law documents as well as actual law firm data has been investigated by Conrad et al. (2005). Schweighofer (1999) has conducted research on automatic text analysis of international law. In this paper we use text mining methods to investigate Austrian supreme ad- ministrative court jurisdictions concerning dues and taxes. The data is described in Section 2 and analyzed in Section 3. Results of applying clustering and classifica- tion techniques are compared to those found by tax law experts. We also propose a method for automatic feature extraction (e.g., of the senate size) from Austrian supreme court jurisdictions. Section 4 concludes. 570 Ingo Feinerer and Kurt Hornik 2 Administrative Supreme Court jurisdictions 2.1 Data The data set for our text mining investigations consists of 994 text documents. Each document contains a jurisdiction of the Austrian supreme administrative court (Ver- waltungsgerichtshof, VwGH) in German language. Documents were obtained through the legal information system (Rechtsinformationssystem, RIS; http://ris. bka.gv.at/ ) coordinated by the Austrian Federal Chancellery. Unfortunately, docu- ments delivered through the RIS interface are HTML documents oriented for browser viewing and possess no explicit metadata describing additional jurisdiction details (e.g., the senate with its judges or the date of decision). The data set corresponds to a subset of about 1000 documents of material used for the research project “Analyse der abgabenrechtlichen Rechtsprechung des Verwaltungsgerichtshofes” supported by a grant from the Jubiläumsfonds of the Austrian National Bank (Oesterreichische Nationalbank, OeNB), see Nagel and Mamut (2006). Based on the work of Achatz et al. (1987) who analyzed tax law jurisdictions in the 1980s this project investigates whether and how results and trends found by Achatz et al. compare to jurisdictions between 2000 and 2004, giving insight into legal norm changes and their effects and unveiling information on the quality of executive and juristic authorities. In the course of the project, jurisdictions especially related to dues (e.g., on a federal or communal level) and taxes (e.g., income, value-added or corporate taxes) were clas- sified by human tax law experts. These classifications will be employed for validating the results of our text mining analyses. 2.2 Data preparation We use the open source software environment R for statistical computing and graph- ics, in combination with the R text mining package tm to conduct our text mining ex- periments. R provides premier methods for clustering and classification whereas tm provides a sophisticated framework for text mining applications, offering functional- ity for managing text documents, abstracting the process of document manipulation and easing the usage of heterogeneous text formats. Technically, the jurisdiction documents in HTML format were downloaded through the RIS interface. To work with this inhomogeneous set of malformed HTML documents, HTML tags and unnecessary white space were removed resulting in plain text documents. We wrote a custom parsing function to handle the automatic import into tm’s infrastructure and extract basic document metadata (like the file number). 3 Investigations 3.1 Grouping the jurisdiction documents into tax classes When working with larger collections of documents it is useful to group these into clusters in order to provide homogeneous document sets for further investigation by Text Mining of Supreme Administrative Court Jurisdictions 571 experts specialized on relevant topics. Thus, we investigate different methods known in the text mining literature and compare their results with the results found by law experts. k-means Clustering We start with the well known k-means clustering method on term-document ma- trices. Let tf t,d be the frequency of term t in document d, m the number of docu- ments, and df t is the number of documents containing the term t. Term-document matrices M with respective entries Z t,d are obtained by suitably weighting the term- document frequencies. The most popular weighting schemes are Term Frequency (tf ), where Z t,d = tf t,d ,andTerm Frequency Inverse Document Frequency (tf-idf ), with Z t,d = tf t,d log 2 (m/df t ), which reduces the impact of irrelevant terms and high- lights discriminative ones by normalizing each matrix element under consideration of the number of all documents. We use both weightings in our tests. In addition, text corpora were stemmed before computing term-document matrices via the Rstem (Temple Lang, 2006) and Snowball (Hornik, 2007) R packages which provide the Snowball stemming (Porter, 1980) algorithm. Domain experts typically suggest a basic partition of the documents into three classes (income tax, value-added tax, and other dues). Thus, we investigated the ex- tent to which this partition is obtained by automatic classification. We used our data set of about 1000 documents and performed k-means clustering, for k ∈{2, ,10}. The best results were in the range between k = 3andk = 6 when considering the im- provement of the within-cluster sum of squares. These results are shown in Table 1. For each k, we compute the agreement between the k-means results based on the term-document matrices with either tf or tf-idf weighting and the expert rating into the basic classes, using both the Rand index (Rand) and the Rand index corrected for agreement by chance (cRand). Row “Average” shows the average agreement over the four ks. Results are almost identical for the two weightings employed. Agree- Table 1. Rand index and Rand index corrected for agreement by chance of the contingency tables between k-means results, for k ∈{3, 4,5,6}, and expert ratings for tf and tf-idf weight- ings. Rand cRand k tf tf-idf tf tf-idf 3 0.48 0.49 0.03 0.03 4 0.51 0.52 0.03 0.03 5 0.54 0.53 0.02 0.02 6 0.55 0.56 0.02 0.03 Average 0.52 0.52 0.02 0.03 ments are rather low, indicating that the “basic structure” can not easily be captured by straightforward term-document frequency classification. We note that clustering of collections of large documents like law corpora presents formidable computational challenges due to the dimensionality of the term-document [...]... employed for representing the text documents 3. 3 Deriving the senate size Table 3 Number of jurisdictions ordered by senate size obtained by fully automated text mining heuristics The percentage is compared to the percentage identified by humans 0 3 5 9 Senate size Documents 0 255 739 0 0.000 25. 654 74 .34 6 0.000 Percentage Human Percentage 2.116 27 .30 6 70 .55 1 0.027 Jurisdictions of the Austrian supreme... Donaulandes Buchdruckerei Egon Hutzler, Reutlingen 654 John Nerbonne, Peter Kleiweg, Wilbert Heeringa and Franz Manni JAIN, A K., MURTY, M N., and FLYNN, P J (1999): Data Clustering: A Review ACM Computing Surveys 31 (3) , 264 32 3 KLEIWEG, P., NERBONNE, J and BOSVELD, L (2004): Geographic Projection of Cluster Composites In: A Blackwell, K Marriott and A Shimojima (Eds.) Diagrammatic Representation and. .. Shippagan, New Brunswick, 33 -49 FELSENSTEIN, J (2004): Inferring Phylogenies Sinauer, Sunderland, MA FISCHER, M (1980): Regional Taxonomy: A Comparison of Some Hierarchic and NonHierarchic Strategies Regional Science and Urban Economics 10, 5 03 53 7 GOEBL, H (1984): Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF 3 Vol Max Niemeyer,... Landgrafroda Borstendorf Gornsdorf Theuma Mockern Cursdorf Osterfeld Wehrsdorf 54 97 60 69 72 100 88 65 72 100 100 100 93 56 53 100 100 96 55 Fig 3 A Composite Dendrogram where labels indicate how often a groups of sites was clustered and the (horizontal) length of the brackets reflects mean cophenetic distance If we let Mi stand in this case for the matrix obtained by adding noise (in the i-th iteration),... use the data analyzed by Nerbonne and Siedle (20 05) consisting of 201 word pronunciations recorded and transcribed at 186 sites throughout all of contemporary Germany The data was collected and transcribed by researchers at Marburg between 1976 and 1991 It was digitized and analyzed in 20 03 2004 The distance between word pronunciations was measured using a modified version of edit distance, and full... Stanford, 1–44 MANNI, F HEERINGA, W and NERBONNE, J (2006): To what Extent are Surnames Words? Comparing Geographic Patterns of Surnames and Dialect Variation in the Netherlands In Literary and Linguistic Computing 21(4), 50 7 -52 8 MUCHA, H.J and HAIMERL, E (20 05) : Automatic Validation of Hierarchical Cluster Analysis with Application in Dialectometry In: C Weihs and W Gaul (Eds.) Classification— the... für Klassifikation, Dortmund, Mar 9–11, 2004 Springer, Berlin, 5 13 52 0 NERBONNE, J., HEERINGA, W and KLEIWEG, P (1999): Edit Distance and Dialect Proximity In: D Sankoff and J Kruskal (Eds.) Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, 2nd ed CSLI, Stanford, v-xv NERBONNE, J and SIEDLE, Ch (20 05) : Dialektklassifikation auf der Grundlage aggregierter Ausspracheunterschiede... 72(2), 129-147 PAGE, R.D.M., and HOLMES, E.C (2006): Molecular Evolution: A Phylogenetic Approach (1 1998) Blackwell, Oxford SCHILTZ, G (1996): German Dialectometry In: H.-H Bock and W Polasek (Eds.) Data Analysis and Information Systems: Statistical and Conceptual Approaches Proc of 19th Mtg of Gesellschaft für Klassifikation, Basel, Mar 8–10, 19 95 Springer, Berlin, 52 6– 53 9 SPRUIT, M (2006): Measuring... 0.0-1 KARATZOGLOU, A and FEINERER, I (2007): Text Clustering with String Kernels in R In: Advances in Data Analysis (Proceedings of the 30 th Annual Conference of the GfKl) 91–98 Springer-Verlag KARATZOGLOU, A., SMOLA, A and HORNIK, K (2006): kernlab: Kernel-based machine learning methods including support vector machines, R package version 0.9-1 KARATZOGLOU, A., SMOLA, A., HORNIK, K and ZEILEIS, A (2004):... tests with tf and tf-idf weighting, where we used the first 200 rows (i.e, entries in the matrix representing documents) as training set, the next 50 rows as test set Table 2 Rand index and Rand index corrected for agreement by chance of the contingency tables between SVM classification results and expert ratings for documents under federal fiscal code regulations tf tf-idf Rand 0 .59 0.61 cRand 0.18 0.21 . weight- ings. Rand cRand k tf tf-idf tf tf-idf 3 0.48 0.49 0. 03 0. 03 4 0 .51 0 .52 0. 03 0. 03 5 0 .54 0. 53 0.02 0.02 6 0 .55 0 .56 0.02 0. 03 Average 0 .52 0 .52 0.02 0. 03 ments are rather low, indicating that. Extrablatt 0 T1 15 O – 53 1 T2 15 house – Hauptstr 2 T64 street – Heidelberg 3 T 15 city – 69117 4 T2 15 zip • shift 1: – 1 -1:Extrablatt -1:T1 15 0: 53 0:T2 15 1:Hauptstr 1:T64 house 58 2 Berenike Loos and Chris. -1: 53 -1:T2 15 0:Hauptstr 0:T64 1:Heidelberg 1:T 15 street • shift 2: – 1 -2:Cafe -2:T10 -1:Extrablatt -1:T1 15 0: 53 0:T2 15 1:Hauptstr 1:T64 2:Heidelberg 2:T 15 house – 2 -2:Extrablatt -2:T1 15 -1 :53

Ngày đăng: 05/08/2014, 21:21

TỪ KHÓA LIÊN QUAN