Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science LNAI Series Editors Randy Goebel University of Alberta, Edmonton, Canada Yuzuru Tanaka Hokkaido University, Sapporo, Japan Wolfgang Wahlster DFKI and Saarland University, Saarbrücken, Germany LNAI Founding Series Editor Joerg Siekmann DFKI and Saarland University, Saarbrücken, Germany 7113 Peter Vrancx Matthew Knudson Marek Grze´s (Eds.) Adaptive and Learning Agents International Workshop, ALA 2011 Held at AAMAS 2011 Taipei, Taiwan, May 2, 2011 Revised Selected Papers 13 Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Peter Vrancx Vrije Universiteit Brussel AI and Computational Modeling Lab 1050 Brussel, Belgium E-mail: pvrancx@vub.ac.be Matthew Knudson Carnegie Mellon University NASA Ames Research Park Moffet Field, CA 94035, USA E-mail: matt.knudson@sv.cmu.edu Marek Grze´s University of Waterloo School of Computer Science Waterloo, ON, N2L 3G1, Canada E-mail: mgrzes@cs.uwaterloo.ca ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-28498-4 e-ISBN 978-3-642-28499-1 DOI 10.1007/978-3-642-28499-1 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2012931796 CR Subject Classification (1998): I.2.6, I.2.11, I.6.8, F.1, I.5 LNCS Sublibrary: SL – Artificial Intelligence © Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer Violations are liable to prosecution under the German Copyright Law The use of general descriptive names, registered names, trademarks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Preface This book contains selected papers from the 2011 Adaptive and Learning Agents Workshop (ALA2011), held during the Autonomous Agents and Multi-Agent Systems Conference (AAMAS) in Taipei, Taiwan The ALA workshop resulted from the merger of the ALAMAS and ALAg workshops ALAMAS was an annual European workshop on adaptive and learning agents and multi-agent systems, held eight times ALAg was the international workshop on adaptive and learning agents, typically held in conjunction with AAMAS To increase the strength, visibility, and quality of the workshops, ALAMAS and ALAg were combined into the ALA workshop, and a Steering Committee was appointed to guide its development The goal of ALA is to increase awareness and interest in adaptive agent research, encourage collaboration, and provide a representative overview of current research in the area of adaptive and learning agents It aims at bringing together not only different areas of computer science (e.g., agent architectures, reinforcement learning, and evolutionary algorithms), but also different fields studying similar concepts (e.g., game theory, bio-inspired control, and mechanism design) The workshop serves as an interdisciplinary forum for the discussion of ongoing or completed work in adaptive and learning agents and multi-agent systems This book contains seven carefully selected papers, which were presented at the ALA2011 workshop Each paper was thoroughly reviewed and revised over two separate review rounds The accepted papers cover a wide range of topics, including: single and multi-agent reinforcement learning, transfer learning, agent simulation, minority games, and agent coordination In addition to these papers, we are also pleased to present an invited chapter by Peter McBurney of the Agents and Intelligent Systems Group at King’s College London Prof McBurney presented his work on “Co-learning Segmentation in Marketplaces” in an invited talk at the ALA2011 workshop We would like to extend our gratitude to everyone who contributed to this edition of the ALA workshop Organizing an event such as ALA would not be possible without the efforts of many motivated people First, we would like to thank all authors who responded to our call-for-papers, as well as our invited speaker, Peter McBurney We are also thankful to the members of our Program Committee for their high-quality reviews, which ensured the strong scientific content of the workshop Finally, we would like to thank the members of the ALA Steering Committee for their guidance, and the AAMAS conference for providing an excellent venue for our workshop October 2011 Peter Vrancx Matt Knudson Marek Grze´s Organization Steering Committee Franziska Klă ugl Daniel Kudenko Ann Nowe Lynne E Parker Sandip Sen Peter Stone Kagan Tumer Karl Tuyls University of Orebro, Sweden University of York, UK Vrije Universiteit Brussel, Belgium University of Tennessee, USA University of Tulsa, USA University of Texas at Austin, USA Oregon State University, USA Maastricht University, The Netherlands Program Chairs Peter Vrancx Matt Knudson Marek Grze´s Vrije Universiteit Brussel, Belgium Carnegie Mellon University, USA University of Waterloo, Canada Program Committee Adrian Agogino Bikramjit Banerjee Vincent Corruble Steven de Jong Enda Howley Franziska Klă ugl W Bradley Knox Daniel Kudenko Ann Now´e Lynne Parker Scott Proper Michael Rovatsos Sandip Sen Istv´an Szita Kagan Tumer Karl Tuyls Katja Verbeeck Pawel Wawrzy´ nski UCSC, NASA Ames Research Center, USA University of Southern Mississippi, USA University of Paris 6, France Maastricht University, The Netherlands National University of Ireland, Ireland University of Orebro, Sweden University of Texas at Austin, USA University of York, UK Vrije Universiteit Brussel, Belgium University of Tennessee, USA Oregon State University, USA University of Edinburgh, UK University of Tulsa, USA University of Alberta, Canada Oregon State University, USA Maastricht University, The Netherlands KaHo Sint-Lieven, Belgium Warsaw University of Technology, Poland Table of Contents Invited Contribution Co-learning Segmentation in Marketplaces Edward Robinson, Peter McBurney, and Xin Yao Workshop Contributions Reinforcement Learning Transfer via Common Subspaces Haitham Bou Ammar and Matthew E Taylor 21 A Convergent Multiagent Reinforcement Learning Approach for a Subclass of Cooperative Stochastic Games Thomas Kemmerich and Hans Kleine Bă uning 37 Multi-agent Reinforcement Learning for Simulating Pedestrian Navigation Francisco Martinez-Gil, Miguel Lozano, and Fernando Fern´ andez 54 Leveraging Domain Knowledge to Learn Normative Behavior: A Bayesian Approach Hadi Hosseini and Mihaela Ulieru 70 Basis Function Discovery Using Spectral Clustering and Bisimulation Metrics Gheorghe Comanici and Doina Precup 85 Heterogeneous Populations of Learning Agents in the Minority Game David Catteeuw and Bernard Manderick 100 Solving Sparse Delayed Coordination Problems in Multi-Agent Reinforcement Learning Yann-Michaăel De Hauwere, Peter Vrancx, and Ann Now´e 114 Author Index 135 Co-learning Segmentation in Marketplaces Edward Robinson1, Peter McBurney2, and Xin Yao1 School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK {e.r.robinson,x.yao}@cs.bham.ac.uk Department of Informatics, King’s College London, Strand London WC2R 2LS, UK peter.mcburney@kcl.ac.uk Abstract We present the problem of automatic co-niching in which potential suppliers of some product or service need to determine which offers to make to the marketplace at the same time as potential buyers need to determine which offers (if any) to purchase Because both groups typically face incomplete or uncertain information needed for these decisions, participants in repeated market interactions engage in a learning process, making tentative decisions and adjusting these in the light of experiences they gain Perhaps surprisingly, real markets typically then exhibit a form of parallel clustering: buyers cluster into segments of similar preferences and buyers into segments of similar offers For computer scientists, the interesting question is whether such co-niching behaviours can be automated We report on the first simulation experiments showing automated co-niching is possible using reinforcement learning in a multi-attribute product model The work is of relevance to designers of online marketplaces, of computational resource allocation systems, and of automated software trading agents Introduction In a famous 1929 paper in economics, Harold Hotelling showed that competing sellers in a marketplace may end up offering to their customers very similar products to one another [6] Imagine that potential customers for ice-cream are distributed uniformly along a beach If there is only one supplier of ice-creams to these customers, the rational location for his or her ice-cream stand is in the middle of the beach, since this minimizes the average distance that customers would need to walk to reach the stand If, however, there are two competing ice-cream suppliers, the rational location for these two suppliers is right beside each other in the middle of the beach This is because this location means that each supplier maximizes the number of potential customers for whom his or her stand is the nearest Any other position for the two seller-stands means that for one seller more than half the potential customers are closer to the other stand; that seller therefore has an incentive to move closer to the middle of the beach Given a fixed and uniform distribution of potential customers, then, the final position of the two sellers will be side-by-side in the middle of the beach If the distribution of potential customers is not uniform but has a finite mean, then the final position of the two sellers will be side-by-side at this mean.1 Assuming that both sellers are rational and both know the distribution of potential customers P Vrancx, M Knudson, and M Grze´s (Eds.): ALA 2011, LNCS 7113, pp 1–20, 2012 c Springer-Verlag Berlin Heidelberg 2012 E Robinson, P McBurney, and X Yao Most suppliers would prefer not to be located immediately beside their direct competitors in the conceptual space of their product category Indeed, one view of marketing as an activity is that rational marketers always seek to differentiate a product (or service, or a supplier) in the minds of target customers from alternative products (or services, or suppliers) or alternative means of satisfying the need satisfied by the product, and to differentiate sufficiently that a premium can be charged or that would-be competitors are deterred [10] In addition, because for most product categories customer preferences are complex and diverse — only rarely are they arrayed along a single dimension, as in Hotelling’s model — the range of possible product positionings (seller locations) is immense and their rational selection by a seller complex Not only would a rational seller deciding its product positions consider the distribution of preferences of its potential customers, but also the varied costs and technical requirements of provisioning different offers and positionings, along with the known or likely offers and positionings of competitors.2 Many of these factors essential for a rational positioning decision are unknown to a seller ahead of launch of the product, particularly for those products which create a new product category (e.g., new technologies) As a consequence, seller positioning is often an incremental learning process, in which one or more offers are made to potential customers, with the reactions of buyers, potential buyers, and competitors then being observed, and then the offers modified, withdrawn or replaced with others As a result, what we see in many market categories is a process of self-organization of sellers, in which the sellers gradually settle into a situation where each seller offers different products or services to different target segments of customers, in a form of dynamic positioning But potential customers too may be ignorant of relevant aspects of the marketplace Customers need to learn not only what offers are being made by which suppliers, but also their own preferences across these offers For new product categories, a customer’s preferences may not be known even to themselves in advance of purchase and use For so-called network goods — goods (such as fax machines or networking protocols) where the utility gained by one user depends on the utilities of other users — rational potential customers need to know the preferences of other customers in order to determine their own preferences Thus, customers as well as suppliers may be engaged in incremental co-learning and self-organization, in a form of dynamic segmentation Thus, marketplaces typically exhibit a process of dynamic and incremental co-learning (or co-evolution, or co-self-organization) of product positions and customer segments, with suppliers learning from customers and from other suppliers, and also customers learning from suppliers and from other customers These parallel learning processes inform, and are informed by, each other This phenomenon has also been seen in the games of the CAT Market Design Tournament, where entrants compete to provide exchange services for automated software traders [2,9] However, the clientserver architecture of the CAT Tournament makes it impossible to know to what extent entrant strategies are controlled by humans or are automated For computer scientists interested in software trading and online marketplaces, the question arises whether and to what extent these co-learning processes can be automated This paper reports on See [5] for a detailed guide to market positioning decisions in just one product category, that of mobile-phone services Co-learning Segmentation in Marketplaces the first work undertaken in this domain, work which has application for the design of computational resource allocation systems as well as to automated marketplaces This paper is structured as follows Section presents a formal model of a computational marketplace in which sellers offer multi-attribute products or services to potential buyers Potential buyers have different preferences over these attribute-bundles, and so need to decide which to purchase Likewise, sellers have different costs of provision and need to decide which attribute bundles to offer Section discusses in detail what we call the automatic co-niching problem, and considers reinforcement learning approaches to tackle it Section then reports on a simulation study to compare strategies for automatically locating market niches, and Section concludes the paper A Model of Multi-attribute Resource Allocation This section presents the model describing the distributed approach to multi-attribute resource allocation via a set M of distributed competing double auction marketplaces, which are able to choose the type of resource to be traded within their market, while a set of traders T trade in the resource markets that most suit their preferences and constraints While other models and platforms for studying competition between marketplaces exist, e.g., JCAT [2], they only consider single-attribute resource allocation across marketplaces Thus, the work presented here is motivated by the need for a new model of both trader and marketplace behaviour, which will enable study of the proposed approach, because unlike previous models: (i) the resources are multi-attribute in nature, and traders have preferences and constraints over them; and (ii) marketplaces have to specifically choose what types of multi-attribute resources can be traded within their market 2.1 Abstract Computational Resources Many types of computational resource can be accurately specified in terms of a bundle of attributes, because of their often quantifiable nature In this model we consider abstract computational resources, only assuming that a resource comprises a vector π of n non-price attributes: (1) π = π1 , π2 , , πn , where πi ∈ [0, 1] refers to the attribute-level of the i th attribute Resources can be differentiated by their type, which is defined by the levels of each of their attributes Two resources can be considered identical iff all of their attribute-levels are equal, i.e., π ≡ π ⇐⇒ ∀j , πj1 = πj2 Different consumers will have varying minimum resource requirements, which must be satisfied in order that the resource is useful to them Realistically, these requirements might fall upon a minimum level of storage or randomaccess memory for large data-oriented tasks, or processing power for time-sensitive tasks A user can impart these requirements on their trading agent using a vector rai of minimum constraints: rai = ra1i , ra2i , , rani , 116 Y.-M De Hauwere, P Vrancx, and A Now´e where γ ∈ [0, 1) is the discount factor and expectations are taken over stochastic rewards and transitions This goal can also be expressed using Q-values which explicitly store the expected discounted reward for every state-action pair: T (s, a, s ) max Q(s , a ) Q(s, a) = R(s, a) + γ a s (2) So in order to find the optimal policy, one can learn this Q-function and subsequently use greedy action selection over these values in every state Watkins described an algorithm to iteratively approximate the optimal values Q∗ In the Q-learning algorithm [15], a table consisting of state-action pairs is stored ˆ a) which is the learner’s current hypothEach entry contains the value for Q(s, ∗ ˆ are updated according to esis about the actual value of Q (s, a) The Q-values following update rule: ˆ , a ) − Q(s, ˆ a) ← Q(s, ˆ a) + αt [R(s, a) + γ max Q(s ˆ a)] Q(s, a (3) where αt is the learning rate at time step t 2.2 Markov Game Definition In a Markov Game (MG), actions are the joint result of multiple agents choosing an action individually [9] Ak = {a1k , , ark } is now the action set available to agent k, with k : n, n being the total number of agents present in the system Transition probabilities T (si , , sj ) now depend on a starting state si , ending state sj and a joint action from state si , i.e = (ai1 , , ain ) with aik ∈ Ak The reward function Rk (si , ) is now individual to each agent k, meaning that agents can receive different rewards for the same state transition In a special case of the general Markov game framework, the so-called team games or multi-agent MDPs (MMDPs) optimal policies still exist [1,2] In this case, all agents share the same reward function and the Markov game is purely cooperative This specialisation allows us to define the optimal policy as the joint agent policy, which maximises the payoff of all agents In the non-cooperative case typically one tries to learn an equilibrium between agent policies [6,5,14] These systems need each agent to calculate equilibria between possible joint actions in every state and as such assume that each agent retains estimates over all joint actions in all states 2.3 Sparse Interactions Recent research around multi-agent reinforcement learning (MARL) is trying to make a bridge between a complete independent view of the state of the system and a fully cooperative system where agents share all information Terms such as local or sparse interactions were introduced to describe this new avenue in MARL This intuition is also captured by a Decentralised Sparse Interaction MDP (DECSIMDP) This multi-agent framework consists of a set of single agent MDPs for Solving Sparse Delayed Coordination Problems in MARL 117 states in which agents are not interacting and a collection of MMDPs, containing the states and agents in which agents have to coordinate [11] Melo and Veloso [10] introduced an algorithm where agents learn in which states they need to condition their actions on other agents This approach is called Learning of Coordination and will be referred to in this paper as LoC As such, their approach is a way of solving an DEC-SIMDP To achieve this they augment the action space of each agent with a pseudo-coordination action This action will perform an active perception step This could for instance be a broadcast to the agents to divulge their location or using a camera or sensors to detect the location of the other agents This active perception step will decide whether coordination is necessary or if it is safe to ignore the other agents Since the penalty of miscoordination is bigger than the cost of using the active perception, the agents learn to take this action only in the interaction states of the underlying DEC-SIMDP This approach solves the coordination problem by deferring it to the active perception mechanism De Hauwere et al introduced CQ-learning for solving a DEC-SIMDP [3] This algorithm maintains statistics on the obtained immediate rewards and compares these against a baseline, which it received from training the agents independently of each other or by tracking the evolution of the rewards over time [4] As such, states in which coordination should occur, can be identified and the state information of these states is augmented to include the state information of the other agents These are states in which there exists a statistical significant difference between the rewards of acting alone in the environment and acting with multiple agents or when the rewards radically change over time Hence, this technique does not rely on external mechanisms, such as active perception as in LoC, to so All of these approaches however assume that states in which coordination is required can be identified solely based on the immediate rewards In the following section we will show that this assumption might not always be met and thus there is need for more general algorithms capable of dealing with this issue, since all currently existing techniques fail in such settings 3.1 Learning with Delayed Coordination Problems Problem Statement In single agent RL, the reward signal an agent receives for an action may be delayed When multiple agents are acting together and influencing each other, the effect of such interactions may only become apparent during the course of action Let us consider a variant on the TunnelToGoal environment as an example, depicted in Figure Agents have to reach the goal location in a predetermined order, i.e Agent must reach the goal location before Agent This requirement is reflected in the reward signal the agents receive when they reach the goal If Agent is first, they both receive a reward of +20, if Agent is first in the goal 118 Y.-M De Hauwere, P Vrancx, and A Now´e state, both agents only get a reward of +10 Independent learners are unable to detect the reason for this change in the reward signal since they are unaware of the other agent and as such cannot learn to reach the optimal policy Agents observing the complete system state will be able to solve this problem, but this imposes high requirements on the observability the agents have about the system or their communication abilities G Reward for reaching the goal: Agent is rst: Both receive +20 Agent is rst: Both receive +10 Fig A variant of the TunnelToGoal environment in which the order with which the agents enter the goal influences the reward they observe Since the path to the goal is surrounded by walls, the agents must coordinate at the entrance of the goal, in order to enter the goal in the correct order They will however only observe the fact that they had to coordinate when it is already too late, i.e when they have reached the absorbing goal state In this section we explain our approach of dealing with delayed coordination problems So far, all research within the sparse interaction framework are only using immediate rewards as a way to detect the need for coordination As we explained in the previous section, this view is too limited, since it is not acceptable to assume that this need for coordination is reflected immediately in the reward signal following the action Using the full MG view of the system, such delayed reinforcement signals are propagated through the joint state space and algorithms using this MG view can still learn optimal policies It should however be clear by now, that this view of the system is not a realistic one A DEC-SIMDP is more realistic since it models exactly the coordination dependencies that exist between agents in a limited set of states Since it is not modeled how these dependencies can be resolved, the DEC-SIMDP is still applicable as a framework for delayed coordination problems We will introduce two variants of an algorithm, called Future Coordinating Q-learning (FCQ-learning) This algorithm is closely related to CQ-learning described in [3] As with CQ-learning, coordination states are detected by means of statistical tests on the reward signal, after which the conflicting states are augmented to include the state information of the agents participating in the interaction Solving Sparse Delayed Coordination Problems in MARL 3.2 119 FCQ-Learning with Initialised Agents The first variant of FCQ-learning assumes that agents have been learning for some time alone in the environment As a result, their Q-values have converged to the true state-action values These Q-values will be the baseline for the statistical tests that will determine in which states coordination is needed The basic idea is that if agents experience a negative influence from each other, the Q-values for certain state-action pairs will decrease Since the Q-values are used to bootstrap, this influence will gradually spread throughout the Q-table We illustrate this effect in the environment depicted in Figure 2(a) The agent’s initial position is marked with an X, its goal, with the letter G One agent was learning alone in the environment and was given a reward of +20 for reaching the goal Moving into a wall was penalised with −1 All other actions resulted in a payoff of The agent was trained using a learning rate of 0.02 and acted completely random until its Q-values converged This exploration strategy ensures that all state-action pairs are visited enough to allow the Q-values to converge to the true state-action values After convergence, this Q-table was stored in Q∗ 20 (12,4) (13,4) (14,4) (15,4) (10,3) (9,2) (8,2) (7,2) (6,2) (1,3) 18 16 10 G 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 14 12 10 (a) 50 100 150 200 250 Episodes 300 350 400 450 500 (b) Fig Left: Evolution of the states in which a KS-test for goodness of fit detects a change in the Q-values The darker the shade of the cell, the earlier the change is detected Right: Evolution of the Q-values for the optimal policy after the reward signal for reaching the goal was altered from +20 to +10 After the learning process, the reward for reaching the goal was decreased to +10 and the agent selected its actions using an -greedy strategy with = 0.1 In Figure 2(b) we show the evolution of the Q-values for the actions of the policy to which the agent converged In the legend of this figure we show the index of the state (which corresponds to the indices in Figure 2) together with the index of the action (1 = NORTH, = EAST, = SOUTH, = WEST) The state at the top of the legend is the one closest to the goal, the one at the bottom is the 120 Y.-M De Hauwere, P Vrancx, and A Now´e initial position of the agent We see that the Q-values quickly drop near the goal, followed by the Q-values for states further and further away from the goal until the start location of the agent To detect these changes statistically, FCQ-learning uses a KolmogorovSmirnov test (KS-test) for goodness of fit This statistical test can determine the significance of the difference between a given population of samples and a specified distribution Since the agents have converged to the correct Q-values, the algorithm will compare the evolution of the Q-values when multiple agents are present to the values it learned when acting alone in the environment To validate this idea we tested its concepts in the TunnelToGoal environment A window of Q-values was maintained with the last N values of the Q-value of that particular state-action pair in the experiment described above We will refer to this window as WkQ (sk , ak ) This window, contains the evolution of the Q-value of that state-action pair over the last N updates after we decreased the reward for reaching the goal A KS-test for goodness of fit was used to compare the values of WkQ (sk , ak ), to the optimal Q-value Q∗ (sk , ak ) The order in which significant changes in the Q-values are detected is shown in Figure The darker the shade of the cell, the earlier the change was detected The KS-test detected this change first in the Q-values of the cell adjacent to the goal state Since the Q-values are still being updated, the KS-test continued detecting changes further away from the goal, towards the starting position of the agent This experiment was done using a confidence level of 99.99% for the KS-test Even with this confidence, the test correctly identifies the states in which the Q-values change due to the changed reward signal and does not identify additional changes due to small fluctuations in the Q-values These states narrow down the set of states we have to consider to detect in which state we actually have to coordinate In these states, our Q-values are significantly deteriorating and since the Q-values give an indication of the best possible future rewards an agent can expect from that state onward, it is in these states that FCQ-learning will sample the state information of other agents, together with the received rewards until the episode ends This approach of collecting rewards until termination of an episode is known as Monte Carlo sampling Again, this principle is similar to how CQ-learning samples, but in FCQ-learning the collected rewards until termination of the episode are stored instead of just the immediate rewards These rewards are also grouped, based on the state information of the other agents This is shown in Figure In every local state in which a change in the Q-values was detected, the agent will observe the state information of the other agents when it is at that local state and collect the rewards until termination of the episode When the algorithm has collected enough samples, it performs a Friedmann test This non-parametric statistical test is used to compare observations repeated on the same subjects, or in this case, on the same local states Using a multiple comparison test on this statistical information, the algorithm can determine which state information of other agents is influencing these future rewards and hence augment the local state of the agent with the relevant information about other agents It should Solving Sparse Delayed Coordination Problems in MARL 121 be noted that these states will be augmented in a similar order as the changes in the Q-values are being detected The algorithm will however continue augmenting states, until it reaches a state in which the coordination problem can actually be solved For every augmented state a confidence value is maintained which indicates how certain the algorithm is that this is indeed a state in which coordination might be beneficial Fig Detecting conflict states with FCQ-learning This value is updated at every visit of the local state from which the augmented state was created If an action is selected using augmented state information, the confidence value is increased If only local state information is used, the confidence values of all the augmented states, which were created from this local state are decreased Increasing the value was done by multiplying the confidence value with 1.1, decreasing it by multiplying the value with 0.99 These values allow for some robustness before augmented states are being reduced again to local states 122 Y.-M De Hauwere, P Vrancx, and A Now´e The action selection works as follows The agent will check if its current local state is a state which has been augmented to include the state information of other agents If so, it will check if it is actually in the augmented state This means that it will observe the global state to determine if it contains its augmented state If this is the case, it will condition its action based on this augmented state information, otherwise it acts independently using only its own local state information If its local state information has never been augmented it can also act without taking the other agents into consideration We distinguish two cases for updating the Q-values: An agent is in a state in which it used the global state information to select an action In this situation the following update rule is used: Qjk (sk , ak ) ← (1 − αt )Qjk (sk , ak ) + αt [rk (s, a) + γ maxak Qk (sk , ak )] where Qk stands for the Q-table containing the local states, and Qjk contains the augmented states using global information (sk ) Note that this augmented Q-table is initially empty The Q-values of the local states of an agent are used to bootstrap the Q-values of the states that were augmented with global state information Using this update scheme represents the local adaptation an agent performs to its policy in order to solve the coordination problem, before following its optimal policy again An agent is in a state in which it selected an action using only its local state information In this case the standard single agent Q-learning rule is used with only local state information The pseudo code for FCQ-Learning is given in Algorithm 3.3 FCQ-Learning with Random Initial Q-Values Having initialised agents beforehand which have learned the correct Q-values to complete the single agent task is an ideal situation, since agents can transfer the knowledge they learned in a single agent setting to a multi-agent setting, adapting only their policy when they have to Since this is not always possible, we propose a simple variant of FCQ-learning In the algorithm presented in Section 3.2, the initialised Q-values are being used for the KS-test which will detect in which states the agent should start sampling rewards As such, this test prevents sampling rewards and state information about the other agents in those states where this is not necessary, since it allows an agent to only sample in those states that are being visited by the current policy and in which a change has been detected If this limited set of states in which coordination problems should be explored cannot be obtained because it is impossible to train the agents independently first, it is possible to collect samples for every state-action pair at every timestep This results in a lot more data to run statistical tests on, Solving Sparse Delayed Coordination Problems in MARL 123 most of which will be irrelevant, but relaxes the assumption of having the optimal Q-values of the single agent problem beforehand The changes in Algorithm for this variant are to remove the lines regarding the KS-test on lines 11 to 14 and line 19 and to change the training of the agents on line Algorithm FCQ-Learning algorithm for agent k 1: Train Qk independently first and store a copy in Qk , initialise Qaug to zero, and k list of sample states to {}; 2: Set t = 3: while true 4: observe local state sk (t) 5: if sk (t) is part of an augmented state sk and the information of sk is present in s(t) then using sk 6: Select ak (t) according to Qaug k 7: else 8: Select ak (t) according to Qk using sk 9: end if 10: observe rk = Rk (s(t), a(t)), s k from T (s(t), a(t)) 11: if KS-test fails to reject the hypothesis that the Q-values of Qk (sk (t), ak (t)) are the same as Qk (sk (t), ak (t)) then 12: add state sk (t) to the list of sample states 13: end if 14: if sk (t) is a sample state then 15: Store the state information of other agents, and collect the rewards until termination of the episode 16: if enough samples have been collected then 17: perform Friedmann test on the samples for the state information of the other agents If the test indicates a significant difference, augment sk to include state information of the other agents for which a change was detected 18: end if 19: end if 20: if sk (t) is part of an augmented state sk and the information of sk is present in s(t) then aug 21: Update Qaug k (sk , ak ) ← (1 − αt )Qk (sk ) + αt [rk + γ maxa k Qk (s k , a k )] 22: increment confidence value for sk 23: else 24: Update Qk (sk ) ← (1 − αt )Qk (sk ) + αt [rk + γ maxa k Qk (s k , a k )] 25: decrease confidence value for all sk = sk , sl for which sl is not present in s(t) 26: end if 27: t = t+1 28: end while 124 Y.-M De Hauwere, P Vrancx, and A Now´e Experimental Results We use a set of two and three-agent gridworld games in which we introduced delayed coordination problems These environments are shown in Figure Agents can collide with each other in every cell, and in environments (a), (b) and (c) the agents also have to enter the goal location in a specific order In environment (d), if agents adopt the shortest path to the goal, they collide in the middle of the corridor The initial positions of the agents are marked by an X, their goals are indicated with a bullet in the same colour as the initial position of the agent For environments in which the agents share the same goal, the goal is indicated with a linear blend The reward function is as follows A reward of +20 is given if they reach the goal in the correct order, otherwise they only receive a reward of +10 If agents collide, a penalty of −10 is given Moving into a wall is penalised with −1 These environments are a simplified version of a production process where the different parts that constitute the finished product have to arrive in a certain order to the assembly unit All experiments were run for 20,000 episodes (an episode was completed when all agents were in the goal state) using a learning rate of 0.1 with a time limit of 500,000 steps per episode Exploration was regulated using a fixed -greedy policy with = 0.1 If agents collided they remained in their respective original locations and receive a penalty for colliding On all other occasions, transitions and rewards were deterministic The results described in the remainder of this paragraph are the running averages over 50 episodes taken over 50 independent runs The size of the queue with the stored samples was 10 We compared both FCQ-variants to independent Q-learners (Indep) that learned without any information about the presence of other agents in the environment, joint-state learners (JS), which received the joint location of the agents as state information but chose their actions independently and with LoC For LoC we could not implement a form of virtual sensory input to detect when coordination was necessary for the active perception step The reason for this is that a sensor cannot determine the need for interaction in the future To circumvent this issue, we used a list of joint states in which coordination with the other agent would be better than to play independent1 For environment (b) for instance (TunnelToGoal ), this list contained all the joint states around the entrance of the tunnel Note that FCQ-Learning is learning this list of states in which the active perception function returns true and this information should not be given beforehand An overview of the final solutions found by the different algorithms is given in Table 1 As such this implementation could be seen as incorporating domain knowledge in the algorithm If this knowledge however is not available, an active perception function that always returns true, might be a good option Indep JS LOC FCQ FCQ NI Indep JS LOC FCQ FCQ NI Indep JS LOC FCQ FCQ NI Indep JS LOC FCQ FCQ NI Grid game Bottleneck TunnelToGoal (3 agents) TunnelToGoal Algorithm Environment 81 9.0 ± 0.0 19.4 ± 4.4 21.7 ± 3.1 25 625 29.7 ± 2.4 71.3 ± 23.4 71.3 ± 28.0 55 166, 375 67.08 ± 10.4 148.0 ± 79.9 146.34 ± 76.3 43 1849 54.0 ± 0.8 124.5 ± 32.8 135.0 ± 88.7 #states 4 4 4 4 4 4 4 4 2.4 ± 0.0 0.1 ± 0.0 1.8 ± 0.0 0.1 ± 0.0 0.1 ± 0.0 0.7 ± 0.0 0.0 ± 0.0 0.5 ± 0.0 0.2 ± 0.0 0.2 ± 0.0 0.7 ± 0.1 0.0 ± 0.0 0.6 ± 0.0 0.3 ± 0.0 0.3 ± 0.0 n.a 0.0 ± 0.0 1.7 ± 0.6 0.1 ± 0.0 0.2 ± 0.0 #actions #collisions reward 22.7 ± 30.4 −24.3 ± 35.6 6.3 ± 0.3 18.2 ± 0.6 10.3 ± 2.7 −6.8 ± 8.0 8.1 ± 13.9 17.6 ± 3.7 7.1 ± 6.9 17.9 ± 0.7 37.9 ± 171.0 6.4 ± 3.6 14.9 ± 8.5 16.5 ± 19.7 20.0 ± 33.0 5.7 ± 15.7 14.8 ± 10.7 13.6 ± 12.8 16.4 ± 31.3 11.6 ± 41.7 23.5 ± 67.8 0.2 ± 48.8 24.9 ± 36.6 9.1 ± 28.6 24.5 ± 38.2 −1.9 ± 827.3 14.4 ± 2.5 14.0 ± 3.1 14.4 ± 3.7 14.1 ± 3.8 n.a n.a 23.3 ± 30.8 13.1 ± 36.1 167.2 ± 19, 345.1 −157.5 ± 10, 327.0 17.3 ± 1.3 16.6 ± 0.4 19.2 ± 5.6 15.4 ± 2.3 #steps Table Size of the state space, number of collisions and number of steps for different approaches in the different games (Indep = Independent Q-Learners, JS = Joint-state learners, FCQ = FCQ-Learners, with correctly initialised Q-values, FCQ NI = FCQ-Learners without correctly initialised Q-values.) Solving Sparse Delayed Coordination Problems in MARL 125 126 Y.-M De Hauwere, P Vrancx, and A Now´e Besides collision free, these solutions should yield the highest reward per episode and the least number of steps to complete an episode The results are shown All values are averaged over the last 100 episodes after agents converged to a policy In the smallest environments the agents always using the joint state space perform best This is due to the fact that since agents actively have to coordinate and enter the goal in a particular order, always observing the other agents provides all the sufficient information In small environments this is still manageable LoC is unable to reach acceptable results compared to the other approaches Its active perception function is giving the correct states in which coordination should occur, but since this is not reflected in the immediate reward signal, the penalty for using this action is too big An adaptation to use the sum of the rewards until termination of an episode could be beneficial, but as shown in [10], there is an important relation between the immediate rewards and the penalty for miscoordination Finding the right balance for the reward signal when this dependency between agents is reflected in the future rewards might prove to be very hard or even impossible, since this is not necessary uniform over the state space FCQ-learning does not require such fine tuning of the reward signal for the specific problem task at hand and is as such more suitable for these future coordination issues In environments with larger state spaces, both FCQ variants reach policies that require a smaller number of steps to complete an episode than the other approaches In the largest environment, TunnelToGoal with agents, FCQ-learning outperforms all others in both number of steps to complete an episode and the average reward collected per episode Independent learners simply don’t have the required information to complete this task, whereas joint-state learners have too much information which causes the learning process to be very slow Moreover, a lack of sufficient exploration still results in suboptimal policies after 20,000 learning episodes In Figure we show some sample solutions found by FCQ-learning for the different environments Agent is indicated in red, Agent in blue and Agent 3, if present, in green Arrows with full tails represent actions taken using only local state information The arrows with dotted tails represent actions taken based on augmented state information For environments (a), (b) and (c), Agent (red) has to reach the goal before Agent (blue) and Agent in its turn had to enter the goal state before Agent (green) if there are three agents present In all environments FCQ-learning correctly coordinated In Environment (b), we see that Agent performed a small loop to let Agent pass first Similarly Agent also ’delayed’ for quite some time before going towards the entrance of the tunnel to reach the goal Note that these policies are still using an -greedy strategy with = 0.1, so the agents sometimes performed an exploratory action This why Agent (in red) did not follow the shortest path in environment (b) In the other environments we see that agents also correctly coordinate either by performing a ’wait’ action, or by taking the appropriate actions to let the other agent pass Solving Sparse Delayed Coordination Problems in MARL 127 G G (a) (b) G2 G1 G (c) (d) Fig Sample solutions found by FCQ-learning for the different environments Agent is indicated in red, Agent in blue and Agent in green So far we have shown through these experiments that FCQ-learning manages to find good policies which are both collision free and in which agents have successfully solved the future interactions between them Next, we are also concerned with the learning speed, as this is the issue most multi-agent approaches suffer from when using the complete joint-state joint-action space In Figure we show the evolution of the rewards the agents collect per episode Both independent learners and LoC have trouble reaching the goal in the correct order and quickly settle for a suboptimal policy JS improves its reward over time, but in the TunnelToGoal environment with three agents (Figure 5(b)), this approach needs over 2,000 learning episodes more than the FCQ-variants, to obtain a reward level that is still slightly less than FCQ With FCQ we see the effect of the sampling phase, during which a decrease in the reward is observed This decrease allows for the augmentation of states and quickly increases again In the Bottleneck environment (Figure 5(d)), FCQ-learning needs more time than joint state learners to reach a stable policy, but this policy results in a higher average payoff than the policy of JS 128 Y.-M De Hauwere, P Vrancx, and A Now´e 20 Indep JS FCQ FCQ_NI LOC −20 reward −30 −40 −20 −10 reward 10 20 Indep JS FCQ FCQ_NI LOC 500 1000 1500 2000 2500 3000 2000 4000 8000 10000 (b) TunnelToGoal (a) Grid Game 20 JS FCQ FCQ_NI LOC reward −20 −20 −10 10 10 20 Indep JS FCQ FCQ_NI LOC −10 reward 6000 episodes episodes 2000 4000 6000 episodes (c) TunnelToGoal 8000 10000 2000 4000 6000 8000 10000 episodes (d) Bottleneck Fig Reward collected per episode by the different algorithms for the (a) Grid game 2, (b) TunnelToGoal 3, (c) TunnelToGoal and (d) Bottleneck environments Figure shows the number of steps needed to complete an episode during the learning process In all environments we observe the same effect as in the figures for the collected reward per episode Initially the number of steps of FCQlearning is increasing This is during the time frame in which it is collecting samples and identifying in which states it should observe the state information of other agents contained in the system state to select an action As soon as the correct states are augmented, a sudden decrease in the number of steps to complete an episode can be seen Again, JS needs a lot of time to reduce the number of steps required to complete an episode in the TunnelToGoal environment due to the size of the state space in which it is learning FCQ-learning Solving Sparse Delayed Coordination Problems in MARL 129 50 40 Indep JS FCQ FCQ_NI LOC 30 steps to goal 30 10 20 10 steps to goal 40 Indep JS FCQ FCQ_NI LOC 20 50 does not suffer from this issue since the size of the state space is not linked so closely to the number of agents in the system In the Bottleneck environment (Figure 6(d)) the results for LoC are not visible, because after 10,000 learning episodes, this algorithm still did not manage to find a policy which required less than 100 timesteps to complete the task to reach the goal Contrary to the independent learners however, it did manage to find a policy Independent learners encountered high penalties in the corridor and as such this path was only rarely taken 500 1000 1500 2000 2500 3000 2000 4000 episodes (a) Grid Game 8000 10000 (b) TunnelToGoal 100 JS FCQ FCQ_NI LOC 60 40 60 steps to goal 80 80 100 Indep JS FCQ FCQ_NI LOC 0 20 20 40 steps to goal 6000 episodes 2000 4000 6000 episodes (c) TunnelToGoal 8000 10000 2000 4000 6000 8000 10000 episodes (d) Bottleneck Fig Number of steps needed to complete an episode by the different algorithms for the (a) Grid game 2, (b) TunnelToGoal 3, (c) TunnelToGoal and (d) Bottleneck environments 130 Y.-M De Hauwere, P Vrancx, and A Now´e 3.0 In this environment agents have to take four consecutive actions to pass through the corridor to reach their goal If the Q-values of these actions in these states are not the highest ones, the probability on this happening through consecutive exploratory actions is 0.0001 Finally, we show the average number of collisions per episode in Figure Again we see the effect of the sampling phase of both FCQ-learning variants The number of collisions between the agents using this algorithm increases until the states in which coordination is required are augmented, after which this number drops to Again, JS-learners need more episodes in the TunnelToGoal environment compared to both FCQ-learning algorithms and both independent learners as LoC are unable to learn collision free policies 2.5 Indep JS FCQ FCQ_NI LOC 1.5 collisions 0.0 0.5 1.0 collisions 2.0 Indep JS FCQ FCQ_NI LOC 500 1000 1500 2000 2500 3000 2000 4000 episodes 8000 10000 (b) TunnelToGoal 3.0 3.0 (a) Grid Game 2.5 JS FCQ FCQ_NI LOC 0.0 0.0 0.5 0.5 1.0 1.5 1.5 collisions 2.0 2.0 2.5 Indep JS FCQ FCQ_NI LOC 1.0 collisions 6000 episodes 2000 4000 6000 episodes (c) TunnelToGoal 8000 10000 2000 4000 6000 8000 10000 episodes (d) Bottleneck Fig Number of collisions per episode for the different algorithms for the (a) Grid game 2, (b) TunnelToGoal 3, (c) TunnelToGoal and (d) Bottleneck environments ... e-ISSN 161 1-3 349 ISBN 97 8-3 -6 4 2-2 849 8-4 e-ISBN 97 8-3 -6 4 2-2 849 9-1 DOI 10.1007/97 8-3 -6 4 2-2 849 9-1 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 20129 31796 CR Subject... incremental co -learning and self-organization, in a form of dynamic segmentation Thus, marketplaces typically exhibit a process of dynamic and incremental co -learning (or co-evolution, or co-self-organization)... ALAMAS and ALAg workshops ALAMAS was an annual European workshop on adaptive and learning agents and multi-agent systems, held eight times ALAg was the international workshop on adaptive and learning