410 Oded Maimon and Shahar Cohen In order to achieve good behavior, the agent must explore its environment. Explo- ration means trying different sort of actions in various situations. While exploring, some of the choices may be poor ones, which may lead to severe costs. In such cases, it is more appropriate to train the agent on a computer-simulated model of the en- vironment. It is sometimes possible to simulate an environment without explicitly understanding it. RL methods have been used to solve a variety of problems in a number of do- mains. Pednault et al. (2002) solved targeted marketing problems. Tesauro (1994, 1995) planned an artificial backgammon player with RL. Hong and Prabhu (2004) and Zhang and Dietterich (1996) used RL to solve manufacturing problems. Littman and Boyan (1993) have used RL for the solution of a networking routing problem. Using RL, Crites and Barto (1996) trained an elevator dispatching controller. 20.6 Reinforcement-Learning and Data-Mining This chapter presents an overview of some of the ideas and computation methods in RL. In this section the relation and relevance of RL to DM is discussed. Most DM learning methods are taken from ML. It is popular to distinguish be- tween three categories of learning methods – Supervised Learning (SL), Unsuper- vised Learning and Reinforcement Learning. In SL, the learner is programmed to extract a model from a set of observations, where each observation consists of ex- plaining variables and corresponding responses. In unsupervised learning there is a set of observations but no response, and the learner is expected to extract a helpful representation of the domain from which the observations were drawn. RL requires the learner to extract a model of response based on experience observations that in- clude states, responses and the corresponding reinforcements. SL methods are central in DM and a correlation may be established between SL and RL in the following manner. Consider a learner that needs to extract a model of response for different situations. A supervised learner will rely on a set of ob- servations, each of which is labeled by an advisor (or an oracle). The label of each observation is regarded by the agent as the desired response for the situation intro- duced by the explanatory variables for this observation. In RL the privilege of having an advisor is not given. Instead, the learner views situations (in RL these are called states) chooses responses (in RL these are called actions) autonomously and obtains rewards that indicate how good the choices were. In this approach toward SL and RL, states and realizations of ”explaining variables” are actually the same. In some DM problems, cases arise in which the responses in one situation affect future outcomes. This is typically the case in cost-sensitive DM problems. Since SL relies on labeled observations and assumes no dependence between observations, it is sometimes inappropriate for such problems. The RL model, on the other hand, perfectly fits cost-sensitive DM problems 5 . For example, Pednault et al. (2002) used 5 Despite this claim, there are several difficulties in applying RL methods to DM problems. A serious issue is that DM problems suggest batches of observation stored in a database, whereas RL methods require incremental accumulation of observations through interaction. 20 Reinforcement Learning 411 RL to solve a problem in targeted marketing – deciding on the optimal targeting of promotion efforts in order to maximize the benefits due to promotion. Targeted marketing is a classical DM problem in which the desired response is unknown, and responses taken at one point in time affect the future. (For example, deciding on an extensive campaign for a specific product this month may reduce the effectiveness of a similar campaign the following month). Finally, DM may be defined as a process in which computer programs manipulate data in order to provide knowledge about the domain that produced the data. From the point of view implied by this definition, RL definitely needs to be considered as certain type of DM. 20.7 An Instructive Example In this section, an example-problem from the area of supply-chain management is presented and solved through RL. Specifically, the modeling of the problem as an MDP with unknown reward and state-transition functions is shown; the sequence of Q-Learning is demonstrated; and the relations between RL and DM are discussed with respect to the problem. The term ”supply-chain management” refers to the attempts of an enterprise to optimize processes involved in purchasing, producing, shipping and distributing goods. Among other objectives, enterprises seek to formulate a cost-effective inven- tory policy. Consider the problem of an enterprise that purchases a single product from a manufacturer and sells it to end-customers. The enterprise may maintain a stock of the product in one or more warehouses. The stock help the enterprise re- spond to customer demand, which is usually stochastic. On the other hand, the en- terprise has to invest in purchasing the stock and maintaining it. These activities lead to costs. Consider an enterprise that has two warehouses in two different locations and behaves as follows. At the beginning of epoch t, the enterprise observes the stock levels s 1 (t) and s 2 (t), at the first and the second warehouses respectively. As a re- sponse, it may order from the manufacturer in quantities a 1 (t), a 2 (t) for the first and second warehouse respectively. The decision of how many units to order for each of the warehouses is taken centrally (i.e. simultaneously by a single decision-maker), but the actual orders are issued separately by the two warehouses. The manufacturer charges c d for each unit ordered, and additional c K for delivering an order to a ware- house (i.e. if the enterprise issues orders at both warehouses it is charged a fixed 2c K in addition to direct costs of the units ordered). It is assumed that there is no lead- time (i.e. the units ordered become available immediately after issuing the orders). Subsequently, each of the warehouses observes a stochastic demand. A warehouse that has enough units in stock sells the units and charges p for each sold unit. If one of the warehouses fails to respond to the demand, whereas the other warehouse, after delivering to its customers, can spare units, transshipment is ini- tiated. Transshipment means transporting units between the warehouses in order to 412 Oded Maimon and Shahar Cohen meet demand. Transshipment costs c T for each unit transshipped. Any unit remain- ing in stock by the end of the epoch costs the enterprise c i for that one epoch. The successive epoch begins with the number of units available at the end of the current epoch, and so-on. The enterprise wants to formulate an optimal inventory policy (i.e. given the stock levels and in order to maximize its long-run expected profits the enterprise wants to know when to issue orders, and in what quantities). This problem can be modeled as an MDP (see the definition of MDP in Section 20.2). The stock levels s 1 (t) and s 2 (t) at the beginning of epochs are the states faced by the enterprise’s decision-makers. The possible quantities for two orders are the possible actions given a state. As a consequence of choosing a certain action at a certain state, each warehouse obtains a deterministic quantity-on-hand. As the demand is observed and met (either directly or through transshipment), the actual, immediate profit r t can be calculated as the revenue gained from selling products minus costs due to purchasing the products, de- livering the orders, the transshipments and maintaining inventory. The stock levels at the end of the period, and thus the state for the successive epoch, are also determined. Since the demand is stochastic, both the reward (the profit) and the state-transition function are stochastic. Assuming that the demand functions at the two warehouses are unknown, the problem of the enterprise is how to solve an MDP with unknown reward and state- transition functions. In order to solve the problem via RL, a large number of ex- perience episodes needs to be presented to an agent. Gathering such experience is expensive, because in order to learn an optimal policy, the agent must explore its environment simultaneously to the exploitation of its current knowledge (see discus- sion on the exploration-exploitation dilemma on Section 20.3.2). However, in many cases learning may be based on simulated experience. Consider using Q-Learning (see Section 20.3.2) for the solution of the enter- prise’s problem. Let this application be demonstrated for epoch t = 158, and the ini- tial stock levels s 1 (158)=4, s 2 (158)=2. The agent constantly maintains a unique Q-value for each of the initial stock levels and the quantities ordered. Assumed that capacity at both warehouses is limited to 10 units of stock, the possible actions given the states are: A(s 1 (t),s 2 (t)) = { a 1 ,a 2 : a 1 + s 1 (t) ≤ 10, a 2 + s 2 (t) ≤ 10 } (20.18) The agent chooses an action from the set of possible actions based on some heuristic that breaks the exploration-exploitation dilemma (see discussion in Section 20.3.2). Assume that the current Q-values for the state s 1 (158)=4, and s 2 (158)=2 are as described in Figure 20.1. The heuristic used should tend to choose actions for which the corresponding Q-value is high, while allowing each action to be chosen with a positive probability. Assume that the action chosen is a 1 (158)=0, a 2 (158)=8. This action means that the first warehouse does not issue an order while the second ware- house orders 8 units. Assume that the direct cost per unit is c d = 2, and that the fixed cost for an order is c K = 10. Since only the second warehouse issued an order, the enterprise’s ordering costs are 10 +8·2 = 26. The quantities-on-hand after receiving 20 Reinforcement Learning 413 the order are 4 units in the first warehouse and 10 units in the second warehouse. Assume the demand realizations are 5 units from the first warehouse and a single unit from the second warehouse. Although the first warehouse can provide only 4 units directly, the second warehouse can spare a unit from its stock, transshipment occurs, and both warehouses meet demand. Assume the transshipment cost is c T = 1 for each unit transshipped. Since only one unit needs to be transshipped, the total transshipment cost is 1. In epoch 158, six units were sold. Assuming the enterprise charges p = 10 for each unit sold, the revenue from selling products in this epoch is 60. At the end of the epoch, the stock levels are zero units for the first warehouse and 8 units for the second warehouse. Assuming the inventory costs are 0.5 per unit in stock for one period, the total inventory costs for epoch 158 are 4 (= 8 ·0.5). The immediate reward for that epoch is 60-26-1-4=29. The state for the next epoch is s 1 (159)=0 and s 2 (159)=8. The agent can calculate V 158 (s 1 (159),s 2 (159)) by maximizing the Q-values corresponding with s 1 (159) and s 2 (159), which it holds by the end of epoch 158. Assume that the result of this maximization is 25. Assume that the appropriate learning rate for s 1 = 4, s 2 = 2, a 1 = 0, a 2 = 8 and t = 158 is α 158 (4,2,0,8)=0.1, and that the discount factor is 0.9. The agents update the appropriate entry according to the update rule in Equation 20.14 as follows. Q 159 ( 4,2 , 0,8 )=0.9 ·Q 158 ( 4,2 , 0,8 ) + 0.1 ·[r 158 + γ V 158 (0,8)] = 0.9 ·45 + 0.1 ·[29 + 0.9 ·25]=45.65 . (20.19) The consequence of this update results in a change in the corresponding Q-value as indicated in Figure 20.2. Figure 20.3 shows the learning curve of a Q-Learning agent that was trained to solve the enterprise’s problem in accordance with the parameters assumed in this section. The agent was introduced to 200,000 simulated experience episodes, in each of which the demands were drawn from Poisson distributions with means 5 and 3 for the first and second warehouses respectively. The learning rates were set to 0.05 for all t, and a heuristic based on Boltzmann’s distribution was used to break the exploration-exploitation dilemma (see Sutton and Barto, 1996). The figure shows a plot of the moving average reward (over 2000 episodes) against the experience of the agent while gaining these rewards. This section shows how RL algorithms (specifically how Q-Learning) can be used to learn from data observation. As discussed in Section 20.6, this by itself makes RL, in this case, a DM tool. However, the term DM may imply the use of a SL algorithm. Within the scope of problem discussed here, SL is inappropriate. A supervised learner could induce an optimal (or at-least a near-optimal) policy based on examples of the form s 1 ,s 2 ,a 1 ,a 2 whereas s 1 and s 2 describe a certain state, and a 1 and a 2 are the optimal responses (orders quantity) for that state. However in the case discussed here, such examples are probably not available. The methods presented in this chapter are useful for many application domains, such as: Manufacturing lr18,lr14, Security lr7,l10 and Medicine lr2,lr9, and for many data mining techniques, such as: decision trees lr6,lr12, lr15, clustering lr13,lr8, en- semble methods lr1,lr4,lr5,lr16 and genetic algorithms lr17,lr11. 414 Oded Maimon and Shahar Cohen Fig. 20.1. Q-values for the state encounter on epoch 158 before the update. The value corre- sponding with the action finally chosen is marked. Fig. 20.2. Q-values for the state encounter on epoch 158 after the update. The value corre- sponding with the action finally chosen is marked. References Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier. Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Context- sensitive medical information retrieval, The 11th World Congress on Medical Informat- ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286. Bellman R. Dynamic Programming. Princeton University Press, 1957. 20 Reinforcement Learning 415 Fig. 20.3. The learning curve of a Q-Learning agent assigned to solve the enterprise’s trans- shipment problem. Bertsekas D.P. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, 1987. Bertsekas D.P., Tsitsiklis J.N. Neuro-Dynamic Programming. Athena Scientific, 1996. Claus C., Boutilier, C. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems. AAAI-97 Workshop on Multiagent Learning, 1998. Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007. Crites R.H., Barto A.G. Improving Elevator Performance Using Reinforcement Learning. Advances in Neural Information Processing Systems: Proceedings of the 1995 Confer- ence, 1996. Filar J., Vriez K. Competitive Markov Decision Processes. Springer, 1997. Hong J, Prabhu V.V. Distributed Reinforcement Learning for Batch Sequencing and Sizing in Just-In-Time Manufacturing Systems. Applied Intelligence, 2004; 20:71-87. Howard, R.A. Dynamic Programming and Markov Processes, M.I.T Press, 1960. Hu J., Wellman M.P. Multiagent Reinforcement Learning: Theoretical Framework and Algo- rithm. In Proceedings of the 15th International Conference on Machine Learning, 1998. Jaakkola T., Jordan M.I.,Singh S.P. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 1994; 6:1185-201. Kaelbling L.P., Littman L.M., Moore A.W. Reinforcement Learning: a Survey. Journal of Artificial Intelligence Research 1996; 4:237-85. Littman M.L., Boyan J.A. A Distributed Reinforcement Learning Scheme for Network Rout- ing. In Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications, 1993. 416 Oded Maimon and Shahar Cohen Littman M.L. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Proceedings of the 7th International Conference on Machine Learning, 1994. Littman M. L. Friend-or-Foe Q-Learning in General-Sum Games. Proceedings of the 18th International Conference on Machine Learning, 2001. Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001. Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Pro- ceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002. Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial In- telligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005. Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav- ioral classification of the host, Computational Statistics and Data Analysis, 52(9):4544– 4566, 2008. Pednault E., Abe N., Zadrozny B. Sequential Cost-Sensitive Decision making with Reinforcement-Learning. In Proceedings of the 8th ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, 2002. Puterman M.L. Markov Decision Processes. Wiley, 1994 Rokach, L., Decomposition methodology for classification tasks: a meta decomposer frame- work, Pattern Analysis and Applications, 9(2006):257–271. Rokach L., Genetic algorithm-based feature set partitioning for classification prob- lems,Pattern Recognition, 41(5):1676–1700, 2008. Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo- sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008. Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In- ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480, 2001. Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel- ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158. Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery Handbook, pp. 321–352, 2005, Springer. Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285– 299, 2006, Springer. Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World Scientific Publishing, 2008. Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap- proach, Proceedings of the 14th International Symposium On Methodologies For Intel- ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag, 2003, pp. 24–31. Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer- Verlag, 2004. Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3) (2006), pp. 329–350. Ross S. Introduction to Stochastic Dynamic Programming. Academic Press. 1983. 20 Reinforcement Learning 417 Sen S., Sekaran M., Hale J. Learning to Coordinate Without Sharing Information. In Pro- ceedings of the Twelfth National Conference on Artificial Intelligence, 1994. Sutton R.S., Barto A.G. Reinforcement Learning, an Introduction. MIT Press, 1998. Szepesv ´ ari C., Littman M.L. A Unified Analysis of Value-Function-Based Reinforcement- Learning Algorithms. Neural Computation, 1999; 11: 2017-60. Tesauro G.T. TD-Gammon, a Self Teaching Backgammon Program, Achieves Master Level Play. Neural Computation, 1994; 6:215-19. Tesauro G.T. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 1995; 38:58-68. Watkins C.J.C.H. Learning from Delayed Rewards. Ph.D. thesis; Cambridge University, 1989. Watkins C.J.C.H., Dayan P. Technical Note: Q-Learning. Machine Learning, 1992; 8:279-92. Zhang W., Dietterich T.G. High Performance Job-Shop Scheduling With a Time Delay TD( λ ) Network. Advances in Neural Information Processing Systems, 1996; 8:1024- 30. 21 Neural Networks For Data Mining G. Peter Zhang Georgia State University, Department of Managerial Sciences, gpzhang@gsu.edu Summary. Neural networks have become standard and important tools for data mining. This chapter provides an overview of neural network models and their applications to data mining tasks. We provide historical development of the field of neural networks and present three important classes of neural models including feedforward multilayer networks, Hopfield net- works, and Kohonen’s self-organizing maps. Modeling issues and applications of these models for data mining are discussed. Key words: neural networks, regression, classification, prediction, clustering 21.1 Introduction Neural networks or artificial neural networks are an important class of tools for quan- titative modeling. They have enjoyed considerable popularity among researchers and practitioners over the last 20 years and have been successfully applied to solve a va- riety of problems in almost all areas of business, industry, and science (Widrow, Rumelhart & Lehr, 1994). Today, neural networks are treated as a standard data min- ing tool and used for many data mining tasks such as pattern classification, time series analysis, prediction, and clustering. In fact, most commercial data mining soft- ware packages include neural networks as a core module. Neural networks are computing models for information processing and are par- ticularly useful for identifying the fundamental relationship among a set of variables or patterns in the data. They grew out of research in artificial intelligence; specif- ically, attempts to mimic the learning of the biological neural networks especially those in human brain which may contain more than 10 11 highly interconnected neu- rons. Although the artificial neural networks discussed in this chapter are extremely simple abstractions of biological systems and are very limited in size, ability, and power comparing biological neural networks, they do share two very important char- acteristics: 1) parallel processing of information and 2) learning and generalizing from experience. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_21, © Springer Science+Business Media, LLC 2010 . 131–158. Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery Handbook, pp. 321 –3 52, 20 05, Springer. Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing:. information and 2) learning and generalizing from experience. O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4 _21 , © Springer Science+Business. Springer, pp. 178-196, 20 02. Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and Data Mining: Theory and Applications, Series in Machine Perception and Artificial In- telligence