Data Analysis Machine Learning and Applications Episode 2 Part 6 potx

424 Joaquin Vanschoren and Hendrik Blockeel all their parameters. Also, 86 commonly used classification datasets were taken from the UCI repository and inserted together with their calculated characteristics. Then, to generate a sample of classification experiments that covers a wide range of conditions, while also allowing to test the performance of some algorithms under very specific conditions, some algorithms were explored more thoroughly than others. First, we ran all experiments with their default parameter settings on all datasets. Secondly, we defined sensible values for the most important parameters of the algorithms SMO (which trains a support vector machine), MultilayerPerceptron, J48 (a C4.5 implementation), 1R (a simple rule learner) and Random Forests (an ensemble learner) and varied each of these parameters one by one, while keeping all other parameters at default. Finally, we further explored the parameter spaces of J48 and 1R by selecting random parameter settings until we had about 1000 experiments on each dataset. For all randomized algorithms, each experiment was repeated 20 times with different random seeds. All experiments (about 250,000 in total) where evaluated using 10-fold cross-validation, using the same folds for each dataset. An online interface is available at http://www.cs.kuleuven.be/~dtai/expdb/ for those who want to reuse experiments for their own purposes, together with a full description and code which may be of use to set up similar databases, for example to store, analyse and publish the results of large benchmark studies. 4 Using the database We will now illustrate how easy it is to use this experiment database to investigate a wide range of questions on the behavior of learning algorithms by simply writing the right queries and interpreting the results, or by applying data mining algorithms to model more complex interactions. 4.1 Comparing different algorithms A first question may be “How do all algorithms in this database compare on a specific dataset D?” To investigate this, we query for the learning algorithm name and evaluation result (e.g. predictive accuracy), linked to all experiments on (an instance of) dataset D, which yields the following query: SELECT l.name, v.pred_acc FROM experiment e, learner_inst li, learner l, data_inst di, dataset d, evaluation v WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid and e.data_inst = di.diid and di.did = d.did and d.name= ' D ' We can now interpret the returned results, e.g. by drawing a scatterplot. For dataset monks-problems-2 (a near-parity problem), this yields Fig. 2, giving a clear overview of how each algorithm performs and (for those algorithms whose parameters where varied) how much variance is caused by different parameter settings. Only a few algorithms surpass default accuracy (67%) and while some cover a wide Investigating Classifier Learning Behavior with Experiment Databases 425 Fig. 2. Algorithm performance comparison on the monks-problems-2_test dataset. spectrum (like J48), others jump to 100% accuracy for certain parameter settings (SMO with higher-order polynomial kernels and MultilayerPerceptron when enough hidden nodes are used). We can also compare two algorithms A1 and A2 on all datasets by joining their performance results (with default settings) on each dataset, and plotting them against each other, as shown in Fig. 3. Moreover, querying also allows us to use aggregates and to order results, e.g. to directly build rankings of all algorithms by their average error over all datasets, using default parameters: SELECT l.name, avg(v.mn_abs_err) AS avg_err FROM experiment e, learner l, learner_inst li, evaluation v WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid and li.default = true GROUP BY l.name ORDER BY avg_err asc Similar questions can be answered in the same vein. With small adjustments, we canqueryforthevariance, ofeach algorithm’s error (over all or a single dataset), study how much error rankings differ from one dataset to another, or study how parameter optimization affects these rankings. SELECT s1.name, avg(s1.pred_acc) AS A1_acc, avg(s2.pred_acc) AS A2_acc FROM (SELECT d.name, e.pred_acc FROM WHERE l.name = ' A1 ' ) AS s1 JOIN (SELECT d.name, e.pred_acc FROM WHERE l.name = ' A2 ' ) AS s2 ON s1.name = s2.name GROUP BY s1.name Fig. 3. Comparing relative performance of J48 and OneR with a single query. 426 Joaquin Vanschoren and Hendrik Blockeel 4.2 Querying for parameter effects Previous queries generalized over all parameter settings. Yet, starting from our first query, we can easily study the effect of a specific parameter P by “zooming in” on the results of algorithm A (by adding this constraint) and selecting the value of P linked to (an instantiation of) A, yielding Fig. 4a: SELECT v.pred_acc, lv.value FROM experiment e, learner_inst li, learner l, data_inst di, dataset d, evaluation v, learner_parameter lp, learner_parval lv WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid and l.name= ' A ' and lv.liid=li.liid and lv.pid = lp.pid and lp. name= ' P ' and e.data_inst = di.diid and di.did = d.did and d.name= ' D ' Sometimes the effect of a parameter P may be dependent on the value of another parameter. Such a parameter P2 can however be controlled (e.g. by demanding its value to be larger than V) by extending the previous query with a constraint requiring that the learner instances additionally are amongst those where parameter P2 obeys those constraints. WHERE and lv.liid IN (SELECT lv.liid FROM learner_parval lv, learner_parameter lp WHERE lv.pid = lp.pid and lp.name= ' P2 ' and lv.value>V) Launching and visualizing such queries yield results such as in Fig. 4, clearly showing the effect of the selected parameter and the variation caused by other parameters. As such, it is immediately obvious how general an observed trend is: all constraints are explicitly mentioned in the query. Fig. 4. The effect of the minimal leafsize of J48 on monks-problems-2_test (a), after requiring binary trees (b), and after also suppressing reduced error pruning (c) Investigating Classifier Learning Behavior with Experiment Databases 427 4.3 Querying for the effect of dataset properties It also becomes easy to investigate the interactions between data properties and learning algorithms. For instance, we can use our experiments to study the effect of a dataset’s size on the performance of algorithm A 2 : SELECT v.pred_acc, d.nr_examples FROM experiment e, learner_inst li, learner l, data_inst di, dataset d, evaluation v WHERE v.eid = e.eid and e.learner_inst = li.liid and li.lid = l.lid and l.name= ' A ' and e.data_inst = di.diid and di.did = d.did 4.4 Applying data mining techniques to the experiment database There can be very complex interactions between parameter settings, dataset characteristics and the resulting performance of learning algorithms. However, since a large number of experimental results are available for each algorithm, we can apply data mining algorithms to model those interactions. For instance, to automatically learn which of J48’s parameters have the greatest impact on its performance on monks-problems-2_test (see Fig. 4), we queried for the available parameter settings and corresponding results. We discretized the performance with thresholds on 67% (default accuracy) and 85%, and we used J48 to generate a (meta-)decision tree that, given the used parameter settings, predicts in which interval the accuracy lies. The resulting tree (with 97.3% accuracy) is shown in Fig. 5. It clearly shows which are the most important parameters to tune, and how they affect J48’s performance. Likewise, we can study for which dataset characteristics one algorithm greatly outperforms another. Starting from the query in Fig. 3, we additionally queried for a wide range of data characteristics and discretized the performance gain of J48 over 1R in three classes: “draw”, “win_J48” (4% to 20% gain), and “large_win_J48” (20% to 70% gain). The tree returned by J48 on this meta-dataset is shown in Fig. 6, and clearly shows for which kinds of datasets J48 has a clear advantage over OneR. Fig. 5. Impact of parameter settings. Fig. 6. Impact of dataset properties. 2 To control the value of additional dataset properties, simply add these constraints to the list: WHERE and d.nr_attributes>5 . 428 Joaquin Vanschoren and Hendrik Blockeel 4.5 On user-friendliness The above SQL queries are relatively complicated. Part of this is however a conse- quence of the relatively complex structure of the database. A good user interface, including a graphical query tool and an integrated visualization tool, would greatly improve the usability of the database. 5 Conclusions We have presented an experiment database for classification, providing a well-structured repository of fully described classification experiments, thus allowing them to be easily verified, reused and related to theoretical properties of algorithms and datasets. We show how easy it is to investigate a wide range of questions on the behavior of these learning algorithms by simply writing the right queries and interpreting the results, or by applying data mining algorithms to model more complex interactions. The database is available online and can be used to gain new in- sights into classifier learning and to validate and refine existing results. We believe this database and underlying software may become a valuable resource for research in classification and, more broadly, machine learning and data analysis. Acknowledgements We thank Anneleen Van Assche and Celine Vens for their useful comments and help building meta-decision trees and Anton Dries for implementing the dataset character- izations. Hendrik Blockeel is Postdoctoral Fellow of the Fund for Scientific Research - Flanders (Belgium) (FWO-Vlaanderen), and this research is further supported by GOA 2003/08 “Inductive Knowledge Bases”. References BLOCKEEL, H. (2006): Experiment databases: A novel methodology for experimental research. Lecture Notes in Computer Science, 3933, 72-85. BLOCKEEL, H. and Vanschoren J. (2007): Experiment Databases: Towards an Improved Ex- perimental Methodology in Machine Learning. Lecture Notes in Computer Science, 4702, to appear. KALOUSIS, A. and HILARIO, M. (2000): Building Algorithm Profiles for prior Model Se- lection in Knowledge Discovery Systems. Engineering Intelligent Syst., 8(2). PENG, Y. et al. (2002): Improved Dataset Characterisation for Meta-Learning. Lecture Notes in Computer Science, 2534, 141-152. VAN SOMEREN, M. (2001): Model Class Selection and Construction: Beyond the Pro- crustean Approach to Machine Learning Applications. Lecture Notes in Computer Sci- ence, 2049, 196-217. WITTEN, I.H. and FRANK, E. (2005): Data Mining: Practical Machine Learning Tools and Techniques (2nd edition). Morgan Kaufmann. KNIME: The Konstanz Information Miner Michael R. Berthold, Nicolas Cebron, Fabian Dill, Thomas R. Gabriel, Tobias Kötter, Thorsten Meinl, Peter Ohl, Christoph Sieb, Kilian Thiel and Bernd Wiswedel ALTANA Chair for Bioinformatics and Information Mining, Department of Computer and Information Science, University of Konstanz, Box M712, 78457 Konstanz, Germany contact@knime.org Abstract. The Konstanz Information Miner is a modular environment, which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables simple integration of new algorithms and tools as well as data manipulation or visualization methods in the form of new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture and briefly sketch how new nodes can be incorporated. 1 Overview The need for modular data analysis environments has increased dramatically over the past years. In order to make use of the vast variety of data analysis methods around, it is essential that such an environment is easy and intuitive to use, allows for quick and interactive changes to the analysis process and enables the user to visually explore the results. To meet these challenges data pipelining environments have gathered incredible momentum over the past years. Some of today’s well-established (but un- fortunately also commercial) data pipelining tools are InforSense KDE (InforSense Ltd.), Insightful Miner (Insightful Corporation), and Pipeline Pilot (SciTegic). These environments allow the user to visually assemble and adapt the analysis flow from standardized building blocks, which are then connected through pipes carrying data or models. An additional advantage of these systems is the intuitive, graphical way to document what has been done. KNIME, the Konstanz Information Miner provides such a pipelining environment. Figure 1 shows a screenshot of an example analysis flow. In the center, a flow is reading in data from two sources and processes it in several, parallel analysis flows, consisting of preprocessing, modeling, and visualization nodes. On the left a repository of nodes is shown. From this large variety of nodes, one can select data sources, data preprocessing steps, model building algorithms, as well as visualization tools and drag them onto the workbench, where they can be connected to other nodes. The 320 Berthold et al. Fig. 1. An example analysis flow inside KNIME. ability to have all views interact graphically (visual brushing) creates a powerful environment to visually explore the data sets at hand. KNIME is written in Java and its graphical workflow editor is implemented as an Eclipse (Eclipse Foundation (2007)) plug-in. It is easy to extend through an open API and a data abstraction framework, which allows for new nodes to be quickly added in a well-defined way. In this paper we describe some of the internals of KNIME in more detail. More information as well as downloads can be found at http://www.knime.org . 2 Architecture The architecture of KNIME was designed with three main principles in mind. • Visual, interactive framework: Data flows should be combined by simple drag&drop from a variety of processing units. Customized applications can be modeled through individual data pipelines. • Modularity: Processing units and data containers should not depend on each other in order to enable easy distribution of computation and allow for independent de- velopment of different algorithms. Data types are encapsulated, that is, no types are predefined, new types can easily be added bringing along type specificren- derers and comparators. New types can be declared compatible to existing types. • Easy expandability: It should be easy to add new processing nodes or views and distribute them through a simple plugin mechanism without the need for complicated install/deinstall procedures. KNIME: The Konstanz Information Miner 321 In order to achieve this, a data analysis process consists of a pipeline of nodes, connected by edges that transport either data or models. Each node processes the arriv- ing data and/or model(s) and produces results on its outputs when requested. Fig- ure 2 schematically illustrates this process. The type of processing ranges from basic data operations such as filtering or merging to simple statistical functions, such as computations of mean, standard deviation or linear regression coefficients to computation intensive data modeling operators (clustering, decision trees, neural networks, to name just a few). In addition, most of the modeling nodes allow for an interactive exploration of their results through accompanying views. In the following we will briefly describe the underlying schemata of data, node, workflow management and how the interactive views communicate. 2.1 Data structures All data flowing between nodes is wrapped within a class called DataTable , which holds meta-information concerning the type of its columns in addition to the actual data. The data can be accessed by iterating over instances of DataRow . Each row contains a unique identifier (or primary key) and a specific number of DataCell objects, which hold the actual data. The reason to avoid access by Row ID or index is scalability, that is, the desire to be able to process large amounts of data and therefore not be forced to keep all of the rows in memory for fast random access. KNIME employs a powerful caching strategy which moves parts of a data table to the hard drive if it becomes too large. Figure 3 shows a UML diagram of the main underlying data structure. 2.2 Nodes Nodes in KNIME are the most general processing units and usually resemble one node in the visual workflow representation. The class Node wraps all functionality and Fig. 2. A schematic for the flow of data and models in a KNIME workflow. 322 Berthold et al. makes use of user defined implementations of a NodeModel , possibly a NodeDialog , and one or more NodeView instances if appropriate. Neither dialog nor view must be implemented if no user settings or views are needed. This schema follows the well- known Model-View-Controller design pattern. In addition, for the input and output connections, each node has a number of Inport and Outport instances, which can either transport data or models. Figure 4 shows a UML diagram of this structure. 2.3 Workflow management Workflows in KNIME are essentially graphs connecting nodes, or more formally, a direct acyclic graph (DAG). The WorkflowManager allows to insert new nodes and to add directed edges (connections) between two nodes. It also keeps track of the status of nodes (configured, executed, ) and returns, on demand, a pool of exe- cutable nodes. This way the surrounding framework can freely distribute the work- load among a couple of parallel threads or – in the future – even a distributed cluster of servers. Thanks to the underlying graph structure, the workflow manager is able to determine all nodes required to be executed along the paths leading to the node the user actually wants to execute. Fig. 3. A UML diagram of the data structure and the main classes it relies on. KNIME: The Konstanz Information Miner 323 Fig. 4. A UML diagram of the Node and the main classes it relies on. 2.4 Views and interactive brushing Each Node can have an arbitrary number of views associated with it. Through re- ceiving events from a HiLiteHandler (and sending events to it) it is possible to mark selected points in such a view to enable visual brushing. Views can range from simple table views to more complex views on the underlying data (e. g. scatterplots, parallel coordinates) or the generated model (e. g. decision trees, rules). 2.5 Meta nodes So-called Meta Nodes wrap a sub workflow into an encapsulating special node. This provides a series of advantages such as enabling the user to design much larger, more complex workflows and the encapsulation of specific actions. To this end some customized meta nodes are available, which allow for a repeated execution of the enclosed sub workflow, offering the ability to model more complex scenarios such as cross-validation, bagging and boosting, ensemble learning etc. Due to the modularity of KNIME, these techniques can then be applied virtually to any (learning) algorithm available in the repository. Additionally, the concept of Meta Nodes helps to assign dedicated servers to this subflow or export the wrapped flow to other users as a predefined module. 2.6 Distributed processing Due to the modular architecture it is easy to designate specific nodes to be run on separate machines. But to accommodate the increasing availability of multi-core ma- [...]... many connections within Analysis of Stock Markets 1847 1847 1 969 1 969 1 922 1775 1898 1858 1 924 1948 761 860 18 16 361 1 8 26 9 70 460 0 1 922 100 56 1775 5145 1898 1858 18 16 1 8 26 903 92 20000 924 00 26 98 22 800 6 62 8 26 32 198 90035 1 924 1948 191 060 24 7 4407 300 2 123 470 21 51 923 10 3 518 26 28 109 360 108 560 Fig 2 Reduced adjacency matrix entries for the traders within the most prominent clusters among themselves Fig... behavior of the group of traders with IDs 1 922 , 1775, 1898 and 1858 Here it can be stated that the connection between 1 922 , 1898 and 1858 is quite strong, and the trading behavior was nearly balanced between 1 922 and 1858, while the behavior between 1 922 and 1898 has a stronger outbound direction from 1 922 to 1858 Between the first and second block the eigenvectors 3 and 4 describe normal trading behavior... BERTHOLD, M R (20 07): Parallel and distributed data pipelining with KNIME Mediterranean Journal of Computers and Networks, Special Issue on Data Mining Applications on Supercomputing and Grid Environments To appear WITTEN, I H and FRANK, E (20 05): Data Mining: Practical machine learning tools and techniques Morgan Kaufmann, San Francisco http://www.cs.waikato.ac.nz/˜ml/weka/index.html On the Analysis of... 3 06 traders 190 traders 29 1 traders 107 86 transactions 7 shares 21 4 .6 shares 24 62 . 1 monetary units 26 5 563 78 monetary units 23 14197 shares Traders in the market are given 100.000 monetary units (MU) as initial endowment The market itself ran a continuous double auction market mechanism where offers by traders are executed immediately if they match For each share an order Analysis of Stock Markets 23 ... summarized in the next section 3.1 Standard nodes • • • • • • Data I/O: generic file reader, and reader for the attribute-relation file format (ARFF), database connector, CSV and ARFF writer, Excel spreadsheet writer Data manipulation: row and column filtering, data partitioning and sampling, sorting or random shuffling, data joiner and merger Data transformation: missing value replacer, matrix transposer,... Analysis of Stock Markets 23 1 62 108 63 157 + c1 0.31 0.31 0.077 0.039 0.019 0.018 0.017 var 0.015 0.0 12 0.011 0.011 0.0093 0.0090 0.00 72 359 c2 sgn c3 c4 22 161 199 EV 53 29 Fig 1 Eigenvectors of the traders within the most prominent clusters book is provided by the system where buy and sell offers are added and subsequently removed in the case of matching or withdrawal 4 .2 Generating the network In... Modelling Applications in demography, social, economic and environmental sciences Physica, Heidelberg, 37–59 GEYER-SCHULZ, A and HOSER, B (20 05): Eigenspectralanalysis of Hermitian Adjacency Matrices for the Analysis of Group Substructures Journal of Mathematical Sociology, 29 (4), 26 5 29 4 HANSEN, J., SCHMIDT, C and STROBEL, M (20 04): Manipulation in political stock markets - preconditions and evidence... with IDs 1 924 and 1948 These again show a nearly balanced behavior, as do the traders with IDs 18 16 and 1 8 26 in the lower right hand corner of the figure These results were compared to the trading data in the data base The result is given in Figure 2 The setup is similar to Figure 1 and it can easily be verified that the trading behavior is consistent with the results from the eigensystem analysis Whenever... visualize and illustrate the results of the eigensystem analysis as a graph, we have taken the respective subgraph which shows the relevant actors as found by the eigensystem analysis, embedded into the network of all their trading counterparts in Figure 3 As can be seen the relevant actors really have many connections within Analysis of Stock Markets 1847 1847 1 969 1 969 1 922 1775 1898 1858 1 924 1948 761 860 ... Description of the dataset We analyze a dataset generated by the political stock market system PSM used for the prediction of the 20 06 state elections in Baden-Wuerttemberg, Germany The traders were mainly readers of rather serious and politically balanced newspapers all over the election region The market ran from January, 31st 20 06 until election day on March, 26 th 20 06 for about twelve weeks and was stopped . 361 108 560 1 8 26 109 36 0 26 285183 18 16 923 1 021 51 1948 123 47 02 1 924 30044079003 5 26 98 1858 24 7198 924 00 1898 26 3 26 62 8 200005145 1775 22 800903 921 91 060 100 56 1 922 70 460 0 1 969 9 761 860 1847 1 8 26 18 161 9481 924 1858189817751 922 1 969 1847 108 560 1 8 26 109 36 0 26 285183 18 16 923 1 021 51 1948 123 47 02 1 924 30044079003 5 26 98 1858 24 7198 924 00 1898 26 3 26 62 8 200005145 1775 22 800903 921 91 060 100 56 1 922 70 460 0 1 969 9 761 860 1847 1 8 26 18 161 9481 924 1858189817751 922 1 969 1847 Fig 361 108 560 1 8 26 109 36 0 26 285183 18 16 923 1 021 51 1948 123 47 02 1 924 30044079003 5 26 98 1858 24 7198 924 00 1898 26 3 26 62 8 200005145 1775 22 800903 921 91 060 100 56 1 922 70 460 0 1 969 9 761 860 1847 1 8 26 18 161 9481 924 1858189817751 922 1 969 1847 108 560 1 8 26 109 36 0 26 285183 18 16 923 1 021 51 1948 123 47 02 1 924 30044079003 5 26 98 1858 24 7198 924 00 1898 26 3 26 62 8 200005145 1775 22 800903 921 91 060 100 56 1 922 70 460 0 1 969 9 761 860 1847 1 8 26 18 161 9481 924 1858189817751 922 1 969 1847 Fig. 2. Reduced. counterparts in Figure 3. As can be seen the relevant actors really have many connections within Analysis of Stock Markets 361 108 560 1 8 26 109 36 0 26 285183 18 16 923 1 021 51 1948 123 47 02 1 924 30044079003 5 26 98 1858 24 7198 924 00 1898 26 3 26 62 8 200005145 1775 22 800903 921 91 060 100 56 1 922 70 460 0 1 969 9 761 860 1847 1 8 26 18 161 9481 924 1858189817751 922 1 969 1847 108 560 1 8 26 109 36 0 26 285183 18 16 923 1 021 51 1948 123 47 02 1 924 30044079003 5 26 98 1858 24 7198 924 00 1898 26 3 26 62 8 200005145 1775 22 800903 921 91 060 100 56 1 922 70 460 0 1 969 9 761 860 1847 1 8 26 18 161 9481 924 1858189817751 922 1 969 1847 Fig.

Định dạng
Số trang	25
Dung lượng	553,76 KB