1. Trang chủ
  2. » Công Nghệ Thông Tin

Data Mining and Knowledge Discovery Handbook, 2 Edition part 83 doc

10 194 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 363,05 KB

Nội dung

800 Haixun Wang, Philip S. Yu, and Jiawei Han Table 40.6. Benefits (US $) using Single Classifiers and Classifier Ensembles (Original Stream). Chunk G 0 G 1 =E 1 G 2 E 2 G 4 E 4 G 8 E 8 12000 201717 203211 197946 253473 211768 269290 215692 289129 6000 103763 98777 101176 121057 102447 138565 106576 143620 4000 69447 65024 68081 80996 69346 90815 70325 96153 3000 43312 41212 42917 59293 44977 67222 46139 71660 Cost-sensitive Learning For cost-sensitive applications, we aim at maximizing benefits. In Figure 40.7(a), we compare the single classifier approach with the ensemble approach using the credit card transaction stream. The benefits are averaged from multiple runs with different chunk size (ranging from 3000 to 12000 transactions per chunk). Starting from K = 2, the advantage of the ensemble approach becomes obvious. In Figure 40.7(b), we average the benefits of E k and G k (K = 2,···,8) for each fixed chunk size. The benefits increase as the chunk size does, as more fraudulent transactions are discovered in the chunk. Again, the ensemble approach outperforms the single classifier approach. To study the impact of concept drifts of different magnitude, we derive data streams from the credit card transactions. The simulated stream is obtained by sorting the original 5 million transactions by their transaction amount. We perform the same test on the simulated stream, and the results are shown in Figure 40.7(c) and 40.7(d). Detailed results of the above tests are given in Table 40.6 and 40.5. 40.5 Discussion and Related Work Data stream processing has recently become a very important research domain. Much work has been done on modeling (Babcock et al., 2002), querying (Babu and Widom, 2001,Gao and Wang, 2002,Greenwald and Khanna, 2001), and mining data streams, for instance, several papers have been published on classification (Domingos and Hulten, 2000, Hulten et al., 2001, Street and Kim, 2001), regression analysis (Chen et al., 2002), and clustering (Guha et al., 2000). Traditional Data Mining algorithms are challenged by two characteristic features of data streams: the infinite data flow and the drifting concepts. As methods that require multiple scans of the datasets (Shafer et al., 1996) can not handle infinite data flows, several incremental algorithms (Gehrke et al., 1999, Domingos and Hulten, 2000) that refine models by continuously incorporating new data from the stream have been proposed. In order to handle drifting concepts, these methods are revised again to achieve the goal that effects of old examples are eliminated at a certain rate. In terms of an incremental decision tree classifier, this means we have to discard, re-grow sub trees, or build alternative subtrees under a node (Hulten et al., 2001). The resulting algorithm is often complicated, which indicates substantial efforts are 40 Mining Concept-Drifting Data Streams 801 required to adapt state-of-the-art learning methods to the infinite, concept-drifting streaming environment. Aside from this undesirable aspect, incremental methods are also hindered by their prediction accuracy. Since old examples are discarded at a fixed rate (no matter if they represent the changed concept or not), the learned model is supported only by the current snapshot – a relatively small amount of data. This usually results in larger prediction variances. Classifier ensembles are increasingly gaining acceptance in the data mining com- munity. The popular approaches to creating ensembles include changing the in- stances used for training through techniques such as Bagging (Bauer and Kohavi, 1999) and Boosting (Freund and Schapire, 1996). The classifier ensembles have sev- eral advantages over single model classifiers. First, classifier ensembles offer a sig- nificant improvement in prediction accuracy (Freund and Schapire, 1996, Tumer and Ghosh, 1996). Second, building a classifier ensemble is more efficient than building a single model, since most model construction algorithms have super-linear complex- ity. Third, the nature of classifier ensembles lend themselves to scalable paralleliza- tion (Hall et al., 2000) and on-line classification of large databases. Previously, we used averaging ensemble for scalable learning over very-large datasets (Fan, Wang , Yu, and Stolfo, 2003). We show that a model’s performance can be estimated be- fore it is completely learned (Fan, Wang , Yu, and Lo, 2002, Fan, Wang , Yu, and Lo, 2003). In this work, we use weighted ensemble classifiers on concept-drifting data streams. It combines multiple classifiers weighted by their expected prediction accuracy on the current test data. Compared with incremental models trained by data in the most recent window, our approach combines talents of set of experts based on their credibility and adjusts much nicely to the underlying concept drifts. Also, we introduced the dynamic classification technique (Fan, Chu, Wang, and Yu, 2002) to the concept-drifting streaming environment, and our results show that it enables us to dynamically select a subset of classifiers in the ensemble for prediction without loss in accuracy. Ackowledgement We thank Wei Fan of IBM T. J. Watson Research Center for providing us with a revised version of the C4.5 decision tree classifier and running some experiments. References Babcock B., Babu S. , Datar M. , Motawani R. , and Widom J., Models and issues in data stream systems, In ACM Symposium on Principles of Database Systems (PODS), 2002. Babu S. and Widom J., Continuous queries over data streams. SIGMOD Record, 30:109– 120, 2001. Bauer, E. and Kohavi, R., An empirical comparison of voting classification algorithms: Bag- ging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. Chen Y., Dong G., Han J., Wah B. W., and Wang B. W., Multi-dimensional regression anal- ysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hongkong, China, 2002. 802 Haixun Wang, Philip S. Yu, and Jiawei Han Cohen W., Fast effective rule induction. In Int’l Conf. on Machine Learning (ICML), pages 115–123, 1995. Domingos P., and Hulten G., Mining high-speed data streams. In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71–80, Boston, MA, 2000. ACM Press. Fan W., Wang H., Yu P., and Lo S. , Progressive modeling. In Int’l Conf. Data Mining (ICDM), 2002. Fan W., Wang H., Yu P., and Lo S. , Inductive learning in less than one sequential scan, In Int’l Joint Conf. on Artificial Intelligence, 2003. Fan W., Wang H., Yu P., and Stolfo S., A framework for scalable cost-sensitive learning based on combining probabilities and benefits. In SIAM Int’l Conf. on Data Mining (SDM), 2002. Fan W., Chu F., Wang H., and Yu P. S., Pruning and dynamic scheduling of cost-sensitive ensembles, In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002. Freund Y., and Schapire R. E., Experiments with a new boosting algorithm, In Int’l Conf. on Machine Learning (ICML), pages 148–156, 1996. Gao L. and Wang X., Continually evaluating similarity-based pattern queries on a streaming time series, In Int’l Conf. Management of Data (SIGMOD), Madison, Wisconsin, June 2002. Gehrke J., Ganti V., Ramakrishnan R., and Loh W., BOAT– optimistic decision tree con- struction, In Int’l Conf. Management of Data (SIGMOD), 1999. Greenwald M., and Khanna S., Space-efficient online computation of quantile summaries, In Int’l Conf. Management of Data (SIGMOD), pages 58–66, Santa Barbara, CA, May 2001. Guha S., Milshra N., Motwani R., and O’Callaghan L., Clustering data streams, In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000. Hall L., Bowyer K., Kegelmeyer W., Moore T., and Chao C., Distributed learning on very large data sets, In Workshop on Distributed and Parallel Knowledge Discover, 2000. Hulten G., Spencer L., and Domingos P., Mining time-changing data streams, In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97–106, San Francisco, CA, 2001. ACM Press. Quinlan J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo- sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008. Shafer C., Agrawal R., and Mehta M., Sprint: A scalable parallel classifier for Data Mining, In Proc. of Very Large Database (VLDB), 1996. Stolfo S., Fan W., Lee W., Prodromidis A., and Chan P., Credit card fraud detection using meta-learning: Issues and initial results. In AAAI-97 Workshop on Fraud Detection and Risk Management, 1997. Street W. N. and Kim Y. S., A streaming ensemble algorithm (SEA) for large-scale classifi- cation. In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2001. Tumer K. and Ghosh J., Error correlation and error reduction in ensemble classifiers, Con- nection Science, 8(3-4):385–403, 1996. Utgoff, P. E., Incremental induction of decision trees, Machine Learning, 4:161–186, 1989. Wang H., Fan W., Yu P. S., and Han J., Mining concept-drifting data streams using ensemble classifiers, In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 2003. 41 Mining High-Dimensional Data Wei Wang 1 and Jiong Yang 2 1 Department of Computer Science, University of North Carolina at Chapel Hill 2 Department of Electronic Engineering and Computer Science, Case Western Reserve University Summary. With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an ur- gent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality. Key words: High-dimensional Data Mining, frequent pattern, clustering high- dimensional data, classifying high-dimensional data 41.1 Introduction The emergence of various new application domains, such as bioinformatics and e- commerce, underscores the need for analyzing high dimensional data. In a gene ex- pression microarray data set, there could be tens or hundreds of dimensions, each of which corresponds to an experimental condition. In a customer purchase behavior data set, there may be up to hundreds of thousands of merchandizes, each of which is mapped to a dimension. Researchers and practitioners are very eager in analyzing these data sets. Various Data Mining models have been proven to be very successful for analyz- ing very large data sets. Among them, frequent patterns, clusters, and classifiers are three widely studied models to represent, analyze, and summarize large data sets. In this chapter, we focus on the state-of-art techniques for constructing these three Data Mining models on massive high-dimensional data sets. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_41, © Springer Science+Business Media, LLC 2010 804 Wei Wang and Jiong Yang 41.2 Chanllenges Before presenting any algorithm for building individual Data Mining models, we first discuss two common challenges for analyzing high-dimensional data. The first one is the curse of dimensionality. The complexity of many existing Data Mining algorithms is exponential with respect to the number of dimensions. With increas- ing dimensionality, these algorithms soon become computationally intractable and therefore inapplicable in many real applications. Secondly, the specificity of similarities between points in a high dimensional space diminishes. It was proven in (Beyer et al., 1999) that, for any point in a high dimensional space, the expected gap between the Euclidean distance to the closest neighbor and that to the farthest point shrinks as the dimensionality grows. This phenomenon may render many Data Mining tasks (e.g., clustering) ineffective and fragile because the model becomes vulnerable to the presence of noise. In the re- mainder of this chapter, we present several state-of-art algorithms for mining high- dimensional data sets. 41.3 Frequent Pattern Frequent pattern is a useful model for extracting salient features of the data. It was originally proposed for analyzing market basket data (Agrawal, 1994). A market bas- ket data set is typically represented as a set of transactions. Each transaction contains a set of items from a finite vocabulary. In principle, we can represent the data as a matrix, each row represents a transaction and each column represents an item. The goal is to find the collection of itemsets appearing in a large number of transactions, defined by a support threshold t. Most algorithms for mining frequent patterns utilize the Apriori property stated as follows. If an itemset A is frequent (i.e., present in more than t transactions), then every subset of A must be frequent. On the other hand, if an itemset A is infrequent (i.e, present in less than t transactions), then any superset of A is also infrequent. This property is the basis of all level-wise search algorithms. The general procedure consists of a series of iterations beginning with counting item occurrences and identifying the set of frequent items (or equivalently, frequent 1- itemsets). During each subsequent iteration, candidates for frequent k-itemsets are proposed from frequent (k-1)-itemsets using the Apriori property. These candidates are then validated by explicitly counting their actual occurrences. The value of k is incremented before the next iteration starts. The process terminates when no more frequent itemset can be generated. We often refer to this level-wise approach as the breadth-first approach because it evaluates the itemsets residing at the same depth in the lattice formed by imposing the partial order of subset-superset relationship between itemsets. It is a well-known problem that the full set of frequent patterns contains sig- nificant redundant information and consequently the number of frequent patterns is often too large. To address this issue, Pasquier et al. (1999) proposed to mine a se- lective subset of frequent patterns, called closed frequent patterns. If the number of 41 Mining High-Dimensional Data 805 occurrences of a pattern is the same to all its immediate subpatterns, then the pat- tern is considered as a closed pattern. The CLOSET algorithm (Pei et al., 2000) is proposed to expedite the mining of closed frequent patterns. CLOSET uses a novel frequent pattern tree (FP structure) as a compact representation to organize the data set. It performs a depth-first search, that is, after discovering a frequent itemset A,it searches for superpatterns of A before checking A’s siblings. A more recent algorithm for mining frequent closed pattern is CHARM (Zaki and Hsiao, 2002). Similar to CLOSET, CHARM searches for patterns in a depth- first manner. The difference between CHARM and CLOSET is that CHARM stores the data set in a vertical format where a list of row IDs is maintained for each dimen- sion. These row ID lists are then merged during a “column enumeration” procedure that generates row ID lists for other nodes in the enumeration tree. In addition, a technique called diffset is used to reduce the length of the row ID lists as well as the computational complexity of merging them. All previous algorithms can find frequent closed patterns when the dimensional- ity is low to moderate. When the number of dimensions is very high, e.g., greater than 100, the efficiency of these algorithms could be significantly impacted. CARPEN- TER (Pan et al., 2003) is therefore proposed to solve this problem. It first transposes the matrix representing the data set. Next, CARPENTER performs a depth-first row- wise enumeration on the transposed matrix. It has been shown that this algorithm can greatly reduce the computation time especially when the dimensionality is high. 41.4 Clustering Clustering is a widely adopted Data Mining model that partitions data points into a set of groups, each of which is called a cluster. A data point has a shorter distance to points within the cluster than those outside the cluster. In a high dimensional space, for any point, its distance to its closest point and that to the farthest point tend to be similar. This phenomenon may render the clustering result sensitive to any small perturbation to the data due to noise and make the exercise of clustering useless. To solve this problem, Agrawal et. al. proposed a subspace clustering model (Agrawal et al., 1998). A subspace cluster consists of a subset of objects and a subset of dimensions such that the distance among these objects is small within the given set of dimensions. The CLIQUE algorithm (Agrawal et al., 1998) is proposed to find the subspace clusters. In many applications, users are more interested in the objects that exhibit a con- sistent trend (rather than points having similar values) within a subset of dimensions. One such example is the bicluster model (Cheng and Church, 2000) proposed for an- alyzing gene expression profiles. A bicluster is a subset of objects (U) and a subset dimensions (D) such that objects in U have the same trend (i.e., fluctuating simulta- neously) across dimensions in D. This is particular useful in analyzing gene expres- sion levels in a microarray experiment since the expression levels of some genes may be inflated/deflated systematically in some experiments. Thus, the absolute value is not as important as the trend. If two genes have similar trends across a large set 806 Wei Wang and Jiong Yang of experiments, they are likely to be co-regulated. In the bicluster model, the mean squared error residue is used to qualify a bicluster. Cheng and Church (2000) used a heuristic randomized algorithm to find biclusters. It consists of a series of iterations, each of which locates one bicluster. To prevent the same bicluster from being re- ported again in subsequent iterations, each time when a bicluster is found, the values in the bicluster are replaced by uniform noise before the next iteration starts. This procedure continues until a desired number of biclusters are discovered. Although the bicluster model and algorithm have been used in several appli- cations in bioinformatics, it has two major drawbacks: (1) the mean squared error residue may not be the best measure to qualify a bicluster. A big cluster may have small mean squared error residue even if it includes a small number of objects whose trends are vastly different in the selected dimensions; (2) the heuristic algorithm may be interfered by the noise artificially injected after each iteration and hence may not discover overlapped clusters properly. To solve these two problems, the authors of (Wang et al., 2002) proposed the p-cluster model. A p-cluster consists of a subset of objects U and a subset of dimensions D where for each pair of objects u 1 and u 2 in U and each pair of dimension d 1 and d 2 in D, the change of u 1 from d 1 to d 2 should be similar to that of u 2 from d 1 to d 2 . A threshold is used to evaluate the dissimilarity between two objects on two dimensions. Given a subset of objects and a subset of dimensions, if the dissimilarity between every pair of objects on every pair of dimen- sions is less than the threshold, then these objects constitute a p-cluster in the given dimensions. A novel deterministic algorithm is developed in (Wang et al., 2002)to find all maximal p-clusters, which utilizes the Apriori property held on p-clusters. 41.5 Classification The classification is also a very powerful data analysis tool. In a classification prob- lem, the dimensions of an object can be divided into two types. One dimension records the class type of the object and the rest dimensions are attributes. The goal of classification is to build a model that captures the intrinsic associations between the class type and the attributes so that an (unknown) class type can be accurately predicted from the attribute values. For this purpose, the data is usually divided into a training set and a test set, where the training set is used to build the classifier which is validated by the test set. There are several models developed for classifying high dimensional data, e.g., na ¨ ıve Bayesian, neural networks, decision trees (Mitchell, 1997), SVMs, rule-based classifiers, and so on. Supporting vector machine (SVM) (Vapnik, 1998) is one of the newly devel- oped classification models. The success of SVM in practice is drawn by its solid mathematical foundation that conveys the following two salient properties. (1) The classification boundary functions of SVMs maximize the margin, which equivalently optimize the general performance given a training data set. (2) SVMs handle a non- linear classification efficiently using the kernel trick that implicitly transforms the input space into another higher dimensional feature space. However, SVM suffers from two problems. First, the complexity of training an SVM is at least O(N 2 ) where 41 Mining High-Dimensional Data 807 N is the number of objects in the training data set. It could be too costly when the training data set is large. Second, since an SVM essentially draws a hyper-plain in a transformed high dimensional space, it is very difficult to identify the principal (original) dimensions that are most responsible for the classification. Rule-based classifiers (Liu et al., 2000) offer some potential to address the above two problems. A rule-based classifier consists of a set of rules in the following form: A 1 [l 1 ,u 1 ]∩A 2 [l 2 ,u 2 ]∩ ∩A m [l m ,u m ]→C, where A i [l i , u i ] is the range of attribute A i ’s value and C is the class type. The above rule can be interpreted as that, if an object whose attributes’ values fall in the ranges in the left hand side, then its class type is likely to be C (with some high probability). Each rule is also associated with a confidence level that depicts the probability that such a rule holds. When an ob- ject satisfies several rules, either the rule with the highest confidence (e.g., CBA (Liu et al., 2000)) or a weighted voting of all valid rules (e.g., CPAR (Yin and Han, 2003)) may be used for class prediction. However, neither CBA nor CPAR are targeted for high dimensional data. An algorithm called FARMER (Cong et al., 2004) is proposed to generate rule-based classifiers for high dimensional data set. It first quantizes the attributes into a set of bins. Each bin is treated as an item subsequently. FARMER then generates the closed frequent itemsets using a method similar to CARPEN- TER. These closed frequent itemsets are the basis to generate rules. Since the di- mensionality is high, the number of possible rules in the classifier could be very large. FARMER finally organizes all rules into compact rule groups. References Agrawal R., Gehrke J., Gunopulos D., Raghavan P.: ”Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, Proc. ACM SIGMOD Int. Conf. on Management of Data, Seattle, WA, 1998, pp. 94-105. Agrawal R., and Srikant R., Fast Algorithms for Mining Association Rules in Large Databases. In Proc. of the 20th VLDB Conf., pages 487-499, 1994. Beyer K.S., Goldstein J., Ramakrishnan R. and Shaft U.: ”When Is ‘Nearest Neigh- bor’ Meaningful?”, Proceedings 7th International Conference on Database Theory (ICDT’99), pp. 217-235, Jerusalem, Israel, 1999. Cheng Y., and Church, G., Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 93-103. San Diego, CA, August 2000. Cong G., Tung Anthony K. H., Xu X., Pan F., and Yang J., Farmer: Finding interesting rule groups in microarray datasets. In the 23rd ACM SIGMOD International Conference on Management of Data, 2004. Liu B., Ma Y., Wong C. K., Improving an Association Rule Based Classifier, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, p.504-509, September 13-16, 2000. Mitchell T., Machine Learning. WCB McGraw Hill, 1997. Pan F., Cong G., Tung A. K. H., Yang J., and Zaki M. J., CARPENTER: finding closed patterns in long biological data sets. Proceedings of ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, 2003. 808 Wei Wang and Jiong Yang Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for as- sociation rules. In Beeri, C., Buneman, P., eds., Proc. of the 7th Int’l Conf. on Database Theory (ICDT’99), Jerusalem, Israel, Volume 1540 of Lecture Notes in Computer Sci- ence., pp. 398-416, Springer-Verlag, January 1999. Pei, J., Han, J., and Mao, R., CLOSET: an efficient Algorithm for mining frequent closed itemsets. In D. Gunopulos and R. Rastogi, eds., ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp 21-30, 2000. Vapnik, V.N., Statistical Learning Theory. John Wiley and Sons, 1998. Wang H., Wang W., Yang J. and Yu P., Clustering by pattern similarity in large data sets. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 394-405, 2002. Yin X., Han J., CPAR: classification based on predictive association rules. Proceedings of SIAM International Conference on Data Mining, San Fransisco, CA, pp. 331-335, 2003. Zaki M. J. and Hsiao C., CHARM: An efficient algorithm for closed itemset mining. In Proceedings of the Second SIAM International Conference on Data Mining, Arlington, VA, 2002. SIAM 42 Text Mining and Information Extraction Moty Ben-Dov 1 and Ronen Feldman 2 1 MDX University, London 2 Hebrew university, Israel Summary. Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored by traditional Data Mining and data analysis methods. In this chapter we will define text min- ing and describe the three main approaches for performing information extraction. In addition, we will describe how we can visually display and analyze the outcome of the information ex- traction process. Key words: text mining, content mining, structure mining, text classification, infor- mation extraction, Rules Based Systems. 42.1 Introduction The information age has made it easy for us to store large amounts of texts. The pro- liferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of information avail- able to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes; So-called “push” tech- nology makes the problem even worse by constantly reminding us that we are failing to track news, events, and trends everywhere. We experience information overload, and miss important patterns and relationships even as they unfold before us. As the old adage goes, “we can’t see the forest for the trees.” Text-mining (TM), also known as Knowledge discovery from text (KDT), refers to the process of extracting interesting patterns from very large text database for the purposes of discovering knowledge. Text-mining applies the same analytical func- tions of data-mining but also applies analytic functions from natural language (NL) and information retrieval (IR) techniques (Dorre et al., 1999). The text-mining tools are used for: O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_42, © Springer Science+Business Media, LLC 2010 . 25 3473 21 1768 26 929 0 21 56 92 289 129 6000 103763 98777 101176 121 057 1 024 47 138565 106576 143 620 4000 69447 65 024 68081 80996 69346 90815 70 325 96153 3000 433 12 4 121 2 429 17 5 929 3 44977 6 722 2 46139 71660 Cost-sensitive. al., 20 02) , querying (Babu and Widom, 20 01,Gao and Wang, 20 02, Greenwald and Khanna, 20 01), and mining data streams, for instance, several papers have been published on classification (Domingos and Hulten,. S., and Han J., Mining concept-drifting data streams using ensemble classifiers, In Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD), 20 03. 41 Mining High-Dimensional Data Wei Wang 1 and

Ngày đăng: 04/07/2014, 05:21