affordance learning based on subtask s optimal strategy

International Journal of Advanced Robotic Systems ARTICLE Affordance Learning Based on Subtask's Optimal Strategy Regular Paper Huaqing Min1, Chang'an Yi1*, Ronghua Luo1, Sheng Bi1, Xiaowen Shen1 and Yuguang Yan1 South China University of Technology, Guangzhou, China *Corresponding author(s) E-mail: yi.changan@mail.scut.edu.cn Received 22 January 2014; Accepted 12 February 2015 DOI: 10.5772/61087 © 2015 Author(s) Licensee InTech This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited Abstract Introduction Affordances define the relationships between the robot and environment, in terms of actions that the robot is able to perform Prior work is mainly about predicting the possi‐ bility of a reactive action, and the object's affordance is invariable However, in the domain of dynamic program‐ ming, a robot’s task could often be decomposed into several subtasks, and each subtask could limit the search space As a result, the robot only needs to replan its sub-strategy when an unexpected situation happens, and an object’s affordance might change over time depending on the robot’s state and current subtask In this paper, we propose a novel affordance model linking the subtask, object, robot state and optimal action An affordance represents the first action of the optimal strategy under the current subtask when detecting an object, and its influence is promoted from a primitive action to the subtask strategy Further‐ more, hierarchical reinforcement learning and state abstraction mechanism are introduced to learn the task graph and reduce state space In the navigation experiment, the robot equipped with a camera could learn the objects’ crucial characteristics, and gain their affordances in different subtasks Humans can solve different tasks in a routine and very efficient way by selecting the appropriate actions or tools to obtain the desired effect Furthermore, their skills are acquired incrementally and continuously through interac‐ tions with the world and other people Research on human and animal behaviour has long emphasized its hierarchical structure—the divisibility of ongoing behaviour into subtask sequences, which in turn are built of simple actions For example, a long-distance driver knows how to reach the destination following the shortest path even if some roads are unexpectedly blocked In this paper, we discuss such cognitive skills in the context of robotics capable of acting in dynamic world and interacting with objects in a flexible way What knowledge representations or cognitive architecture should such a biological system possess to act in such unpredictable environment? How can the system acquire task or domain-specific knowledge to be used in new situations? Keywords cognitive robotics, affordance, subtask strategy, hierarchical reinforcement learning, state abstraction To answer these questions, we resort again to the concept of affordance originated by the American psychologist J.J.Gibson [1], who defined the affordance as the potential action between the environment and organism According to Gibson, some affordances are learned in infancy when the child experiments with external objects Infants first notice the affordances of objects, and only later they begin to recognize their properties, and they are active Int J Adv Robot Syst, 2015, 12:111 | doi: 10.5772/61087 perceivers and can perceive the affordances of objects early in development Although Gibson does not give a specific way to learn affordances, this term has been adopted and further developed in many research fields, ranging from art design [2], human-computer interaction [3], to robot cognition [4] Affordances play an important role in a robot’s basic cognitive capabilities such as prediction and planning; however, there are two points that should be stressed now First, the affordance is the inherent property jointly determined by the robot and environment For instance, the climb-ability of a stair step is not only determined by the metric measure of the step height, but also the robot’s leglength Second, the robot system must first know how to perform a number of actions and develop some perceptual capabilities before learning the affordances Under the concept of affordance, what the robot perceives is not necessarily object names (e.g., doors, cups, desks), but the action possibilities (e.g., passable, graspable, sittable) Furthermore, the affordance of an object might change over time depending on its use, e.g., a cup might first be reachable, then graspable, and finally pourable From the perspective of cognitive robotics, affordances are extremely powerful since they capture essential object and environ‐ ment properties, in terms of the actions that the robot is able to perform, and enable the robot to be aware early of action possibilities [6] Compared with previous research, the main contribution of this paper lies in our novel affordance model: (i) the influence of the affordance is promoted from a primitive action to the subtask strategy; (ii) an object's affordance is related with the optimal strategy of the current subtask, and it might change over time in dynamic environment; (iii) hierarchical reinforcement learning (HRL) and state abstraction mechanism could be applied to learn the subtasks simultaneously and reduce state space The rest of this paper is organized as follows We start with a review of the related work in section Section intro‐ duces our affordance model Section describes the navigation example that is used throughout the paper Section is about the learning framework Section presents the experiment carried out in our simulation platform Finally, we conclude this paper in section Related Work In this section, we discuss affordance research in the robotics field According to the interaction target of the robot, current research could be classified into four categories: object’s manipulation affordance, object’s traversability affordance, object’s affordance in humanrobot context, and tool’s affordance Under these afford‐ ance models, the perceptual representation is discrete or continuous, and some typical learning methods applied in the models are shown in Table Affordance formalization, which could provide a unified autonomous control framework, has also gained a great deal of attention [5] Int J Adv Robot Syst, 2015, 12:111 | doi: 10.5772/61087 Typical learning method Affordance model Advantages Incremental learning of Reinforcement learning primitive actions, and [17, 18] context generalization Bayesian network [6, 7] Object’s manipulation affordance Statistical relational [14] Prediction and planning in bi-directional way Model multi-object relationship Ontology knowledge Handle object’s sudden [15, 16] appear or disappear Support vector machine [9, 10, 21] Object’s manipulation and traversability affordance Probability graphical Object’s traversability model [19, 20] affordance Prediction and multi-step planning Discriminative and generative model for incremental learning Learn object affordances in Markov random field Object’s affordance in [23] human-robot context human context from 3D data, in which the human activities span over long durations Table Typical learning method under current affordance models 2.1 Object’s manipulation affordance This kind of research is focused on predicting the oppor‐ tunities or effects of exploratory behaviours For instance, Montesano et al used probabilistic network that captured the stochastic relations between objects, actions and effects That network allowed bi-directional relation learning and prediction, but could not allow more than one step predic‐ tion [6, 7] Hermans et al proposed the use of physical and visual attributes as a mid-level representation for afford‐ ance prediction, and that model could result in superior generalization performance [8] Ugur et al encoded the effects and objects in the same feature space, their learning system shared crucial elements such as goal-free explora‐ tion and self-observation with infant development [9, 10] Hart et al introduced a paradigm for programming adaptive robot control strategies that could be applied in a variety of contexts, furthermore, behavioural affordances are explicitly grounded in the robot’s dynamic sensorimo‐ tor interactions with its environment [11-13] Moldvan et al employed recent advances in statistical relational learning to learn affordance models for multiple objects that interact with each other, and their approach could be generalized to arbitrary objects [14] Hidayat et al pro‐ posed affordance-based ontology for semantic robots, their model divided the robot’s actions into two levels, object selection and manipulation Based on these semantic attributes, that model could handle situations where objects appear or disappear suddenly [15,16] Paletta et al presented the framework of reinforcement learning for perceptual cueing to opportunities for interaction of robotic agents, and features could be successfully selected that were relevant for prediction towards affordance-like control in interaction, and they believed that affordance perception was the basis cognition of robotics [17, 18] 2.2 Object’s Traversability Affordance This kind of research is about robot traversal in large space Sun et al provided a probabilistic graphical model which utilized discriminative and generative training algorithms to support incremental affordance learning: their model casts visual object categorization as an intermediate inference step in affordance prediction, and could predict the traversability of terrain regions [19, 20] Ugur et al studied the learning and perception of traversability affordance on mobile robots and their method is useful for researchers from both ecological psychology and autono‐ mous robotics [21] 2.3 Object’s Affordance in Human-robot Context Unlike the working environment presented above, Koppu‐ la et al showed that human-actor based affordances were essential for robots working in human spaces in order for them to interact with objects in human desirable way [22] communication, whichproblem: could involve the robot They task treated it as a classification their affordance having a human-like ability to understand the model was based on Markov random field and could detect affordances in task communication [23] the human activities and object affordances from RGB-D videos Heikkila formulated a new affordance model for 2.4 Tool’s affordance astronaut-robot task communication, which could involve the robot having a human-like ability to understand the The ability to use tools is an adaptation mechanism used affordances in organisms task communication [23] by many to overcome the limitations imposed on them by their anatomy For example, chimpanzees use 2.4 Tool’s Affordance stones to crack nuts open and sticks to reach food, dig holes, or attack predators [24] However, studies of The ability to use tools is antool adaptation autonomous robotic use aremechanism still rare.used One by many organisms to overcome the limitations imposed representative example is from Stoytchev, who on them by their aanatomy For example,computational chimpanzeesmodel use formulated behaviour-grounded stonesoftotool crack nuts open and sticks to reach food, dig holes, affordances in the behavioural repertoire of the or attack predators robot [25, 26] [24] However, studies of autonomous robotic tool use are still rare One representative example Our affordance is from Stoytchev, whomodel formulated a behaviour-grounded computational model of tool affordances in the behavioural Affordance-like perception repertoire of the robot [25, 26] could enable the robot to react to environmental stimuli both more efficiently and Furthermore, when planning based on an Ourautonomously Affordance Model object’s affordance, the robot system will be less complex and still more flexiblecould and robust and to thereact robot Affordance-like perception enable [27], the robot could use learned affordance relations to achieve goalto environmental stimuli both more efficiently and auton‐ directed behaviours with its simple primitive behaviours omously Furthermore, when planning based on an object’s [28] The structure behaviour hasand alsostill been affordance, thehierarchical robot system will beofless complex of enduring interest within neuroscience, where it has more flexible and robust [27], and the robot could use been widely considered to reflect prefrontal cortical learned affordance relations to achieve goal-directed functions The intrinsic motivation approach to subgoal behaviours with its simple primitive behaviours [28] The discovery in HRL dovetails with psychological theories, hierarchical structure of behaviour has also been of suggesting that human behaviour is motivated by a drive enduring interest within neuroscience, where it has been toward exploration or mastery, independent of external widely considered to reflect prefrontal cortical functions reward [29] The intrinsic motivation approach to subgoal discovery in HRL dovetails with psychological theories, suggesting that human behaviour is motivated by a drive toward explora‐ tion or mastery, independent of external reward [29] In the existing approaches, the affordance is related to only one action, and the task is finished after that action has been executed However, sometimes the task could be divided into several subtasks, which could be described in a hierarchical graph, and the robot needs a number of actions to finish each subtask following the optimal strategy In this paper, we propose an affordance model as the natural mapping from the subtask, object, robot state, to the optimal action, as illustrated in Figure In this model, the affordance represents the action upon the object under the optimal strategy of the current subtask Furthermore, each subtask has its own goal, and the optimal strategy of a subtask often needs to change when an unexpected situation happens in a dynamic environment Based on Figure 1, the formalization of our affordance model is: optimal_action = f ( subtask , object , robot state ) (1) Affordance prediction is a key task in autonomous robot learning, as it allows a robot to reason about the actions it Affordance prediction is a key task in autonomous robot can perform in order to accomplish its goals [8] This learning, as it allows a robot to reason about the actions it affordance model is somewhat similar can perform in order to accomplish itswith goalsthe [8].models This proposed by model Montesano and Sahin [5-7], they emphasize affordance is somewhat similar withall the models the relationship among the action, object[5-7], and effect, but proposed by Montesano and Sahin they all ours pay more to among the goal strategy the emphasize the attention relationship theand action, object of and subtask effect, but ours pay more attention to the goal and strategy of the subtask subtask object robot state optimal action Figure Our affordance model describes thefrom mapping from Figure Our affordance model describes the mapping subtask, object and robot state to theand optimal action thattorepresents the first action that of the subtask, object robot state the optimal action optimal strategy represents the first action of the optimal strategy Navigation Example example 4.4.Navigation Robotnavigation navigationisis aa typical typical example examplewhere whereaawhole wholetask task Robot couldbebedecomposed decomposed into into several several subtasks, could subtasks,and andthe therobot robot should adjust its optimal strategy when detecting an should adjust its optimal strategy when detecting an obstacle In this work, we use the robot navigation obstacle In this work, we use the robot navigation example example to explain our affordance model The navigation to explain our affordance model The navigation environ‐ environment is shown in Figure 2: the thick black lines ment is shown in Figure 2: the thick black lines represent represent walls and they divide the eight-by-eight maze walls and they divide the eight-by-eight maze into two into two rooms (A, B) Each two neighbouring grids are rooms (A, B) Each is two grids There are reachable reachable if there no neighbouring wall between them are four if there is no wall between them There are four candidate candidate trigger grids (T1, T2, T3, T4) in room A and four trigger gridsgoal (T1,grids T2, T3(G , T1,4)Gin room A and four candidate candidate 2, G3, G4) in room B; the start goal grids (G , G , G , G ) in room B; trigger the start placethat is a grid room A A place is a random in means random grid in room A A trigger means that when the when the robot arrives at the grid, the two doors will both open immediately, as shown in Figure Obstacles In the existing approaches, the affordance is related to Min, Chang'an Yi, Ronghua Luo, Shengand Bi, Xiaowen Shensome and Yuguang will appear dynamically randomly; could beYan: only one action, and the task is finished afterHuaqing that action Affordance Learning Based onnot Subtask's Optimal Strategy rolled away while the others could The robot’s task is has been executed However, sometimes the task could be to first navigate from the start grid to a trigger to make divided into several subtasks, which could be described the doors open, then pass a door, and finally to the goal, in a hierarchical graph, and the robot needs a number of all following the shortest routine actions to finish each subtask following the optimal chieve goalbehaviours as also been where it has ntal cortical to subgoal cal theories, d by a drive of external s related to r that action ask could be be described a number of he optimal G1 A odel as the bot state, to n this model, the object nt subtask al, and the hange when a dynamic ation of our state ) should adjust its optimal strategy when detecting an obstacle In this work, we use the robot navigation example to explain our affordance model The navigation environment is shown in Figure 2: the thick black lines represent walls and they divide the eight-by-eight maze into two rooms (A, B) Each two neighbouring grids are reachable if there is no wall between them There are four candidate trigger grids (T1, T2, T3, T4) in room A and four robot arrives at grids the grid, the two willB;both open candidate goal (G1, G 2, G 3, G4doors ) in room the start immediately, as shown Obstacles will appear place is a random grid in in Figure room A A trigger means that when the robot at the grid, the two doors away will dynamically and arrives randomly; some could be rolled both the openothers immediately, as shown in Figure Obstacles while could not The robot’s task is to first will appear be navigate fromdynamically the start gridand to arandomly; trigger to some make could the doors rolled away while the others couldtonot robot’s task is open, then pass a door, and finally theThe goal, all following to shortest first navigate from the start grid to a trigger to make the routine the doors open, then pass a door, and finally to the goal, all following the shortest routine GotoTrigger T3 GotoDoor G3 G4 Figure Initial environment Figure Initial environment The four primitive actions, North, South, West and Therobot robothas has four primitive actions, North, South, West East, and and theythey are are always executable canbebe and East, always executable.The The task task can decomposed subtasks, GotoTrigger, GotoTrigger, decomposed into into three three successive successive subtasks, GotoDoor, allall realized through the GotoDoor,and andGotoGoal, GotoGoal,which whichare are realized through the primitive actions Thegraph task is graph is illustrated primitive actions The task illustrated in Figurein4, Figuret 4, where t represents target grid of the current where represents the targetthe grid of the current subtask subtask Here, the goal of subtask GotoDoor is toD1reach Here, the goal of subtask GotoDoor is to reach grid or D2 grid D1 or D2 G1 door1 T1 D1 T2 G2 door2 T3 D2 T4 G3 G4 Figure The doors are open Figure 3 The twotwo doors are open The navigation process is a Markov decision process The navigation process is a Markov decision process (MDP) When the robot detects an obstacle it should (MDP) When the robot detects an obstacle it should replan replan the best routine, and the affordance represents the the best routine, and the affordance represents the current current action, which may vary depending on the subtask action, which may vary depending on the subtask and and robot state Moreover, the robot does not necessarily robot state Moreover, the robot does not necessarily touch touch the obstacle when executing its affordance; for the obstacle when executing its affordance; for example, the example, the robot may need to avoid it As a result, the robot may need to avoid it As a result, the existing existing affordance models could not fulfill this mission; affordance models could not fulfill this mission; however, however, ours could work well ours could work well Root Int J Adv Robot Syst, 2015, 12:111 | doi: 10.5772/61087 GotoTrigger GotoDoor GotoGoal re i i formula, state s to performed s to  V (i , s ) =   GotoGoal The relatio The value South West East recursively equation (4 Figure Task graph of the robot V (0, s ) = V Figure Task graph of the robot The Learning The learningFramework framework T4 V (i, s) subtask Navigate(t) North G2 subtask or from B T2 subroutine primitive a Root T1 (1) (MDP) When the robot detects an obstacle it should replan the best routine, and the affordance represents the current action, which may vary depending on the subtask and robot state Moreover, the robot does not necessarily touch the obstacle when executing its affordance; for example, the robot may need to avoid it As a result, the existing affordance models could not fulfill this mission; however, ours could work well might bebe better is more more HRL HRL might bettertotolearn learnthe thetask taskgraph, graph, as as it it is biologically plausible Among the existing HRL methods, biologically plausible Among the existing HRL MAXQ is notable it can learnitthe methods, MAXQbecause is notable because canvalue learn functions the value of all subtasks simultaneously—no need to forwait the functions of all subtasks simultaneously—no wait need to value function for subtask j to converge before learning the value function for its parent task i ; furthermore, a state abstraction mechanism could be applied to reduce the state for the value function for subtask j to converge before space of value functions [30-32] As a result, we choose learning thethevalue function forforits task model i ; MAXQ as learning method ourparent affordance furthermore, a state abstraction mechanism could be applied to reduce the state space of value functions [305.1 Value Function in Task Graph 32] As a result, we choose MAXQ as the learning method for our affordance model.method decomposes a MDP M into Generally, the MAXQ a set of subtasks {M 0, M 1, ⋯ , M n }, M is the root task, and 5.1 Value function in task graph solving it solves the entire task The hierarchical policy π is learned for M and π = {π0 ⋯ πn } ; each subtask M i is a MDP Generally, the MAXQ method decomposes a MDP M and has a policy πi into a set of subtasks {M , M , ⋯ , M n } , M is the Value function Q(i, s, a) decomposed into the The sum of root task, and solving it is solves the entire task two components The first is the expected total reward hierarchical policy π is learned for M and received while executing a, which is denoted by V (a, s) π = {π ⋯π n } ; each subtask M i is a MDP and has a The second is completion function C(i, s, a), which policy π i the expected cumulative discounted reward of describes completing subtask M i after invoking the subroutine for M a in state s is MAXQ, a into is a the subtask subtask Value function sum ofor a Q (i, s, a ) Indecomposed primitive action The optimal value function V (i, s) two components The first is the expected total reward represents the cumulative reward of doing subtask i in received while executing a , which is denoted by state s and it can be described in (2) In this formula, is completion functionfrom VP(s ( a,' |ss,) i)The C (i, state s, a ) ,s to is second a probabilistic transition which describes expected cumulative resulting state sthe ' when primitive action i discounted is performed, R(s ' | s,ofi) completing is the reward received i reward subtask after primitive invoking action the M i when is performed and the state translates from s to s ' subroutine for subtask M a in state s In MAXQ, a is a subtask orìamax primitive Theifoptimal value function ) Q(i , s , aaction i is a subtask ï a í V V(i(,i ,ss)) =represents the cumulative reward of doing (2) ïỵå s ' P( s '| s , i ) R( s '| s , i ) if i is a primitive action subtask i in state s and it can be described in (2) In this formula, P ( s ' | s, i ) is a probabilistic transition from The relationship between functions Q, V and C is: state s to resulting state s ' when primitive action i is RQ( (si ', |s,sa,)i )= Vis( athe , s) +reward C(i , s , a) received when primitive action i is performed and the state translates from s to s ' performed, if i is a subtask max a Q(i, s, a) V (i , s ) =  (2) P ( s ' | s , i ) R ( s ' | s , i ) if i is a primitive action ∑ s ' (3) In this ma substituted primitive GotoTrigger between V The value function for the root, V (0, s), is decomposed recursively into a set of value functions as illustrated in equation (4): V (0, s) = V ( am , s) + C( am -1 , s , am ) + L + C( a1 , s , a2 ) + C(0, s , a1 ) (4) In this manner, to learn the value function of a task is substituted by a number of completion functions and primitive actions Now, we take the first subtask GotoTrig‐ ger as an example to explain the relationship between V and C values If the robot is in grid s and it should navigate to s3, as shown in Figure 5, the value of this subtask is com‐ puted as follows: should navigate to s3 , as shown in Figure 5, the value of this subtask is computed as follows: V (GotoTrigger , s) South , s) + C = VV((GotoTrigger , s()GotoTrigger , s , South) , s1 ) , s, South) = (=-V1)(South + V (GotoTrigger , s ) + C (GotoTrigger should navigate to s3 , as shown in Figure 5, the value of , s1 ) + C(,GotoTrigger , s1 , East ) = (=-(1) −1)++VV((East GotoTrigger s1 ) this subtask is computed as follows: , s2 ) , s1 , East ) = (=-(1) + V,(sGotoTrigger −1)++( V1) ( East ) + C (GotoTrigger (South , s2 , South) = (=-V(2) −(GotoTrigger 1)++V(− 1) + V, (s,GotoTrigger , s2 ) )s2 ) + C(GotoTrigger = (=-=(2) (+South s+) +0C,(GotoTrigger , s, South) , s2 , South) −V2) +( V1) (,South s2 ) + C (GotoTrigger = ( − 1) + V ( GotoTrigger , s ) = =-3(−2) + ( −1) + = =−3(−1) + V ( East, s1 ) + C (GotoTrigger, s1 , East ) = (−1) + (−1) + V (GotoTrigger , s2 ) Thisprocess processcan can also be be represented in a tree This tree)structure structureasas = (−2) + Valso ( South, s2represented ) + C (GotoTriggerin , s2a, South Figure 6; the values of each C and V are shown on top of in in Figure 6;= the (−2) +values ( −1) + of each C and V are shown on top them The reward from s to s3 is -3, i.e., three steps are = −3reward from of them The needed to s s3 is -3, i.e., three steps process can also be represented in a tree structure as areThis needed in Figure 6; the values of each C and V are shown on top of them The reward from s to s3 is -3, i.e., three steps The learning algorithm is illustrated in Table αt (i) is the learning rate that could gradually be decreased, because in later stages the update speed should be increasingly slower γ is the discount factor < αt (i) < 1, < γ ≤ Function MAXQ ( subtask i, start_state s) { if i is a primitive node //leaf node execute i, receive r, and observe the result state s’ vt +1(i, s) = (1 − αt (i)) ⋅ vt (i, s) + αt (i) ⋅ rt else elselet count=0 while s is not the terminal state of subtask i, let count=0 Choose an action a according to π (i, s) while s is not the terminal state of subtask i, let N=MAXQ(a, s) (recursive call) else a according to π(i,s’ s) Choose an action Observe the result state let count=0 N ' N=MAXQ(a, s) (recursive call) whilelet s is the terminal state of subtask i, ct +not (i, s, a) = (1 − α t (i )) ⋅ ct (i, s, a) + α t (i ) ⋅ γ ⋅ vt (i, s ) Choose an the action a according to π (i, s) count=count+N Observe result state s’ let N=MAXQ(a, s) (recursive call) s=s’ ct +1(i, s,the a) =result (1 − αt (i)) ⋅ ct (i,s’s, a) + αt (i) ⋅ γ N ⋅ vt (i, s ') Observe state end // while for all state s in subtask i count=count+N s=s’ v t ( i , s ) = m a x [ v t ( a , s ) + c t ( i , s , a ) ] ct +1 count=count+N (i, s, a) = (1 − α t (i)) ⋅ ct (i, s, a) + α t (i) ⋅ γ N ⋅ vt (i, s ' ) a s=s’ // while // for endend //End while for all all state End //state if s sininsubtask for subtask i i v ( i , s ) = m ax[vt (a , s ) + ct (i, s , a )] } t vt (i, s) = max vta(a, s) + ct (i, s, a) a // Main End // forprogram End // Initialize if // for all v(i, s) and c(i, s, j) arbitrarily End s2 } subtask i, start_state s0) EndMAXQ( // if // Main program } Initialize v(i, s) and c(i, s, j) arbitrarily Table 2.allAlgorithm to learn the task graph of our affordance MAXQ( subtask i, start_state s0) // Main program model s2 all v(i, to s) and c(i, the s, j) arbitrarily TableInitialize Algorithm learn task graph of our affordance 5.3 State abstraction in task graph modelMAXQ( subtask i, start_state s ) are needed s s1s s1 5.2 Learning Algorithm s3 abstraction Algorithm to learn the graph task graph which of our affordance 5.3Table State inQ-learning, task Based on flat is themodel standard s3 Figure A sample route for subtask GotoTrigger Figure A sample route for subtask GotoTrigger Figure A sample route for subtask GotoTrigger -3 V(GotoTrigger, -3 s) -1 V(GotoTrigger, s) -2 -2 C(GotoTrigger,s , South) Qlearning algorithm without subtasks, there are 64 possible 5.3states State Abstraction in Task Graphis the Based on flat Q-learning, standard for the robot, which candidate trigger grids,Q-4 candidate learning algorithm without subtasks, there are 64 possible goalongrids and executable actions; thus, we need Based Q-learning, which is the standard Q-learning states for theflat robot, candidate trigger grids, candidate 64×4×4×4=4096 to represent thepossible value functions algorithm withoutstates subtasks, there are 64 states for goal grids and executable actions; thus, we need the robot, states candidate trigger the grids, candidate 64×4×4×4=4096 to represent value functions goal grids GotoDoor(D 1) and GotoDoor(D 2) need are different subtasks and executable actions; thus, we 64×4×4×4=4096 states to represent the value functions because they have different goals, then the subtask GotoDoor(D1) and GotoDoor(D2) are different subtasks because they have goals,GotoDoor(D then the subtask number is 4: different GotoTrigger, 1), GotoDoor(D2), GotoDoor(D1) and GotoDoor(D2) are different subtasks 1), GotoDoor(D 2), number is 4: GotoTrigger, GotoDoor(D GotoGoal With subtasks but without state a because they have different goals, then the subtaskabstraction, number V(South,s) C(GotoTrigger,s , South) GotoGoal With subtasks but without state abstraction, a position -1 state variable contains the robot state (64), trigger -1 is 4: GotoTrigger, GotoDoor(D1), GotoDoor(D2), GotoGoal -1 state (4), variable contains the robot state (64), trigger position V(East, s1) target position (4), current action (4), subtask number C(GotoTrigger, -1 s1, East) With subtasks but without state abstraction, a state variable V(East, s1) (4), target position (4), current action (4), subtask number C(GotoTrigger, s1, East) (4), and the state number is 64×4×4×4×4=12288 Hence, we the robot state (64), trigger position (4),contains and the state number is 64×4×4×4×4=12288 Hence,(4), we target -1 can see that without state number abstraction, subtask canposition (4), current action (4), subtask (4), and the -1 see that without state abstraction, subtask V(South, C(GotoTrigger, s2, South) state representation requires four times the memory of a flat Q number is 64×4×4×4×4=12288 Hence, we can see that V(South, ss22)) C(GotoTrigger, s2, South) representation requires four times the memory of a flat Q without state abstraction, subtask representation requires table! Figure Value function decomposition table! Figure Value function decomposition V(South,s) -1 Figure Value function decomposition four times the memory of a flat Q table! In our two kinds state of abstractions are appliedare applied Inwork, our work, twoofkinds state abstractions Huaqing Min, [30], Chang'an Yi,“Subtask Ronghua Luo, Sheng Bi, and Xiaowen Shen Yuguang Yan: one is Irrelevance” the other is “Leaf [30], one is “Subtask Irrelevance” and and the other is “Leaf Affordance Learning Based on Subtask's Optimal Strategy Irrelevance” which which will bewill described in brief in the The learning algorithm is illustrated in Table α t (i ) is Irrelevance” be described in in the The learning algorithm is illustrated in Table α t (i ) is following section In order to explain clearly, we drawbrief a following section In order to explain clearly, we draw a the learning rate that could gradually be decreased, the learning rate that could gradually be decreased, new task graph in Figure 7, where the state is determined because in later stages the update speed should be new task graph in Figure 7, where the state is determined by the current action, subtask number and target The Learningalgorithm algorithm 5.25.2Learning GotoDoor, there are 32 grids and two candidate goals in In our work, two kinds of state abstractions are applied [30], one is “Subtask Irrelevance” and the other is “Leaf Irrele‐ room A, then 32×2=64 states are required to represent N2(t), vance” which will be described in brief in the following S2(t), W2(t), or E2(t), 256 states in total Under state abstrac‐ section In order to explain clearly, we draw a new task tion, GotoDoor(D1) and GotoDoor(D2) have the same state graph in Figure 7, where the state is determined by the space and could be included as a single subtask GotoDoor Root 5.3.2 Primitive action irrelevance current action, subtask number and target The completion For subtask GotoGoal, there are 32 grids and four candidate functions are stored in the third level The robot’s move‐ goals in room B, then 32×4=128 states are required to ment, with or without obstacles to avoid, is realized A in set represent E3(t), 512 states total All of state Nvariables for ain primitive Y3(t)isorirrelevant 3(t), S3(t), W GotoTrigger GotoDoor GotoGoal terms of the four primitive actions, and the execution of the these states are for the completion functions in the third action a , if for any pair of states s1 and s2 that differ third level is ultimately transformed into the fourth level level in Figure 5, and the total number is 512+256+512=1280 Take Ni(t) for example (the same rule for Si(t), Wi(t) and only in their values for the variables in Y and (7) exists: (t) S1(t) W1(t) EEi1(t)), 2(t) S2(t) W 2(t) E2“North”, (t) N3(t) (t) NNrepresents 3(t)subtask W3(t) number, E3(t) action i is Sthe 5.3.2 Primitive Action Irrelevance and t is the target grid of this subtask PA ( s set '1 | sof a ) R (variables s '1 | s1 , a )Y= is irrelevant P ( s '2 | s2for , a)aRprimitive (s '2 | s2 , aaction ) (7) , state s '1 s '2 a, if for any pair of states s1 and s2 that differ only in their Root 5.3.2values Primitive irrelevance East West North South foraction the variables in Y and (7) exists: ∑ ∑ In our example, this condition is satisfied by the primitive A set North, of stateSouth, variables irrelevant for a primitive Y isand actions West East, because the reward is GotoTrigger GotoDoor GotoGoal P( s '1 | s1 , a)R( s '1 | s1 , a) = å P( s '2 | s2 , a)R(s '2 | s2 , a) å (7) action a , if for any pair of states s and s that s' constant—then, only one states ' is required each action for differ 3.1 Subtask irrelevance in theirfour values for the variables in Yneeded and (7) exists: As only a result, abstract states are for the fourth N1(t) S1(t) W1(t) E1(t) N2(t) S2(t) W2(t) E2(t) N3(t) S3(t) W3(t) E3(t) level, In and totalthis state spaceis satisfied of this by task graph is our the example, condition the primitive t M i be a subtask of MDP M A set of state variables P ( s '1 | s1 ,North, a ) R ( s '1South, | s1 , a ) =West Pand ( s '2 | sEast, (s '2 | s2 , athe ) (7) ∑ ∑ , a) R actions because reward 1280+4=1284: far fewer than 4096, and the storage spaceisis s' s' is irrelevant to subtask i if the state variables of M constant—then, only one state is required for each action reduced The essence of this abstraction is that only the East West North South As a result, four abstract states are needed for the fourth In ourinformation example, this condition satisfied by the primitive andofYoursuch that for n be partitioned sets X graph Figure into Tasktwo decomposition example related for that isstate is considered With state level, and the West totaland state of the thisreward task is graph is Task decomposition graph of our example actions North, East,space because y stationaryFigure abstract hierarchical policy π executed abstraction, theSouth, learning problem could alsostorage converge 1280+4=1284: far fewer than 4096, and the space[30] is gure Task decomposition graph of our example 1 2 constant—then, only one state is required for each action The essence of this abstraction is fourth that only the As areduced result, four abstract states are needed for the Experimental validation related information forspace that state is considered With level, and the total state of this task graph is state M be a subtask ofprobability MDP M A set of state variables ld: (a) the Let state transition distribution abstraction, the learning problem could alsospace converge [30] Let M ii be a subtask of MDP M A set of state variables Y is 1280+4=1284: far fewer than 4096, and the storage is π Y is for irrelevant to subtask i if the state variables of M each child can be reduced The essence of this abstraction is that only the iaction if the stateofvariables of M can be irrelevant to subtask i We test the navigation example under our own X and such thatfor for any related can be partitioned into two information forValidation that state is considered With state Experimental partitioned into two sets sets X and Y Y such that ctored into the product of two distributions : environment, which is built up in C++ any stationary abstract hierarchical policy π executed simulation stationary abstract hierarchical policy π executed by the abstraction, the learning problem could also converge [30] 5.3.1 Subtask irrelevance M i , the following two properties 5.3.1 Subtask Irrelevance the descendants of ( s ', N | s, j ) j M language, as shown in Figure and Figure The physical We test the navigation example under our own simulation Mfollowing two properties by the descendants descendants ofπ M i ,ofthe two properties hold: (a) i , the following π environment, which is built up in C++ language, asand shown Experimental validation engine is Open Dynamic Engine (ODE) [33], the Pπ ( x ', y', N | x , y , j ) = P ( x ', N | x , j ) ⋅ P ( y ' | x , y , j ) (5) hold: (a) transition the state probability transition probability the state distributiondistribution P π (s ', N | s, j) in Figure and Figure The physical engine is Open render engine is Irrlicht which is an open source high for child of M ichild can be factored product Pπ each ( s ', N | s, action j ) for jeach action j ofinto M ithe can be We Dynamic test the Engine navigation example under our own (ODE) [33], and the render engine is performance realtime 3D engine written in C++ [34] The here x and factored give values for: the in X: , and xof'two distributions into the product of variables two distributions simulation environment, which is high builtperformance up in C++realtime Irrlicht which is an open source floor is3Dpainted in blue or 8white, and9.any adjacent grids language, as shown in Figure and The engine written in C++ [34].Figure The floor is physical painted in blue and y ' give values for the variables in Y ; (b) for π π π are in a different colour The robot’s ability includes engine is Open Dynamic Engine (ODE) [33], and the colour a P ( x ', y', N | x , y , j ) = P ( x ', N | x , j ) ⋅ P ( y ' | x , y , j ) (5) or white, and any adjacent grids are in a different Pp ( x ', y', N | x , y , j ) = Pp ( x ', N | x , j ) × Pp ( y '| x , y , j ) y pair of states s1 = ( x, y1 ) , s2 = ( x, y2 ) , and any (5) render engine is Irrlicht which is an open source high cameraTheand fourability primitive robot’s includesdirectional a camera andactions—North, four primitive performance realtime 3D engineSouth, written in C++ [34] The each where and give values for the variables in X , and x x ' directional actions—North, West, and East—and South, West, and East—and each action is deterministic ild action j , we have : floor is painted in blue or white, and any adjacent grids x and x ' give values for the variables in X , and y and where action is deterministic The camera could capture the front y and y ' give values for the variables in Y ; (b) for The camera could capture the frontability scene when the robot a different colour robot’s y ' give values for the variables in Y ; (b) for any pair of states are in scene when the robotThe reaches the centre includes of a grid,a and we any pair of statesπs1 = ( x, y1 ) , πs2 = ( x, y2 ) , and any reaches the centre of a grid, and we can obtain the R(red), camera and four primitive directional actions—North, s1 = (x, y1), s2 = (x, child have : V y2)(, jand , s1 any ) =V ( jaction , s2 ) j , we(6) can obtain the R(red), G(green), B(blue) values of each pixel South, West, and East—and each action is deterministic child action j , we have : G(green), B(blue) in the picture.values of each pixel in the picture The camera could capture the front scene when the robot V p (final j , s1 ) =goal V p ( j , s2 ) irrelevant (6) reaches the centre of a grid, and we can obtain the R(red), our example, the doors and V π ( jare , s1 ) = V π ( j , s2 ) to (6) G(green), B(blue) values of each pixel in the picture e subtask GotoTrigger—only the current robot position d trigger point areexample, relevant In our our example, thedoors doors and final goal irrelevant In the and final goal areare irrelevant to to the subtask subtaskGotoTrigger—only GotoTrigger—only current robot position the thethe current robot position ke N1(t) in and subtask GotoTrigger for example; there are point andtrigger trigger pointare arerelevant relevant possible positions for the robot because its working Take in subtask subtask GotoTrigger GotoTrigger for Take N N11(t) (t) in for example; example;there thereare are 32 Robot ace is an eight-by-four room,forand four candidate goals 32 possible positions for robot because working possible positions thethe robot because its its working space Robot r the currentspace As a result, 32×4=128 states aregoals is an eight-by-four and four candidate issubtask an eight-by-four room,room, and four candidate goals for the for the current subtask As a result, 32×4=128 states are Obstacle currentNsubtask Assame a result, 32×4=128 are needed to eded to represent 1(t), the result for Sstates 1(t), W 1(t), Obstacle to represent 1(t), the same result for S1(t), W1(t), s represent N1(t),are theN same result S1(t),subtask W1(t), and E1(t), d E1(t), thenneeded 512 values required forforthis Figure Simulation environment with and obstacles s robot and E512 1(t), then 512 values are required for this subtask Figure Simulation environment with robot and obstacles Figure Simulation environment with robot and obstacles then values are required for this subtask For subtask r subtask GotoDoor, there are 32 grids and two For subtask GotoDoor, there are 32 grids and two ndidate 6goals in room A, room then 32×2=64 states states are are candidate goals A, then 32×2=64 Int J Adv Robot Syst,in2015, 12:111 | doi: 10.5772/61087 Room BB Room required N to2represent 2(t), S2(t), E2(t), 256 states quired to represent (t), S2(t),NW 2(t), orWE22(t), (t),or256 states Door Door Door Door in total Under state abstraction, GotoDoor(D 1) and total Under state abstraction, GotoDoor(D1) and GotoDoor(D2) have the same state space and could be toDoor(D2) have the same state space and could be included as a single subtask GotoDoor For subtask here are working ate goals tates are t), W1(t), subtask nd two ates are 56 states D1) and ould be subtask goals in epresent All these d level in 80 Robot Obstacle s Figure Simulation environment with robot and obstacles Room B Door Door Because an object’s affordance changes Room A according to the current subtask it is involved in, the object’s Figure Simulationand environment with two doors open characteristics subtaskwith strategy should Figure Simulation environment two doors open be learned first As a result, this experiment contains three parts: (i) Because an affordance changes the learn obstacles’ rollable affordances in according static; (ii) to subtask Because anobject’s object’s affordance changes according to the current subtask it is involved in, the object’s learning without obstacles ; (iii) the testing process which current subtask it is involved in, the object’s characteristics characteristics and subtask strategy should learned involves calculation in first abe As dynamic and subtaskaffordance strategy should be learned a result, first As a result, this experiment contains three parts: (i) environment this experiment contains three parts: (i) learn obstacles’ learn obstacles’ rollable affordances in static; (ii) subtask rollable affordances in static; (ii) subtask learning without learning without obstacles are ; (iii)shown the testing process The robot andthe obstacles in Figure 10, which andafford‐ the obstacles; (iii) testing process which involves involves affordance calculation in a dynamic obstacles could be different in shape, colour and size The ance calculation in a dynamic environment environment shape includes a cube and sphere, while the size includes For (a) and (b), the left picture is what the robot captures with its own camera, and the right in blue is the detected shape This rollable character is the basis to calculate the obstacle's affordance in a dynamic environment, and shape is the critical feature (a) Cube (a) Cube (b) Sphere small, middle and large.are Forshown each state and its The robot and obstacles in Figure 10,current and the Figure 11 Obstacle detection The robot and obstacles are shown in Figure and the Figure 11 Obstacle detection subtask, there is different a value represent thesize total ( in j , sshape, ) to obstacles could be colour10,and The obstacles could bea different insphere, shape, colour and size size.includes The shape includes cube and while the reward of subtask j starting from state s The policy 6.2 Subtask learning shape includes a cube and sphere, while the size includes small, middle and large isFor each (Greedy state and its Limit current (b) Sphere 6.2 Subtask Learning executed during learning a GLIE in the small, middle and large For each state and its current Figure 11 Obstacle detection This subsection is to learn the optimal strategy without subtask, there is a value( j, s) to represent the total reward with Infinite Exploration) policy, which has three rules: subtask, there is a value( j , s ) to represent the total obstacles, and is also the basis ofthe affordance subsection is to learn optimal calculations strategy without This ofexecutes subtask each j starting from state s The policy executed action for any state infinitely often; reward of subtask starting from state s The policy j 6.2 Subtask learning in a dynamic environment We define the grid place as obstacles, and is also the basis of affordance calculations in during learning is a GLIE (Greedy in the Limit with Infinite converges with probability to a greedy policy; the state, the executionWe of an actionthe lasts from one as the executed during learning is is a has GLIE (Greedy the Limiteach thearobot’s dynamic environment define grid place recursively optimal policy unique [32] Exploration) policy, which three rules:inexecutes This robot’s subsection to next learnone’s the optimal strategy without centre toisthe centre The reward of one any grid’s with Infinite Exploration) policy, which has three rules: state, the execution of an action lasts from action for any state infinitely often; converges with proba‐ grid’s obstacles, and is also the basis of affordance calculations action is -1,one’s but itcentre will remain in the same place executes action for the anyrecursively state infinitely centre to the next The reward of any primitive bility toeach a greedy policy; optimaloften; policy is primitive inifaitdynamic environment We the each grid grid place as hits the wall cube At define any converges with probability to a greedy policy; the action is -1, butoritawill remain in time, the same place could if it hits the unique [32] the robot’sone state, the execution of an and action is lasts from one contain the time, most, assumed that one recursively optimal policy is unique [32] wall or a obstacle cube Atatany each itgrid could contain grid’s to will the next one’s centre The reward of any each centre obstacle be created at the centre of a grid The obstacle at the most, and it is assumed that each obstacle primitive -1, but italgorithm will remain in the place subtask action graph,islearning and statesame abstraction (a) Robot (b) Cube (c) Sphere will be created at the centre of a grid The subtask graph, if mechanism it hits the wall a cube At anyintime, each grid5.could haveorbeen described section and Figure 10 Robot and obstacles learning algorithm and state abstraction mechanism contain one obstacle at the most, and it is assumed that have described sectionat4the andcentre each been obstacle will bein created of a grid The In the learning process, the four triggers and goals will be 6.1 Affordances in static environment subtask graph, learning and state abstraction (a) Robot (b) Cube (c) Sphere chosen randomly, andalgorithm thethe two doors could In the learning process, four triggers andboth goalsbewill be have been described inWe section 4executed and all the Figure Robot and obstacles This 10 subsection discusses the obstacles’ rollable mechanism traversed when they are open have chosen randomly, and the two doors could both be trav‐ Figure 10 Robot and obstacles affordances in a goal-free manner in static environment, 16 ersed pairs when (Trigger_ID, simplicity take they areGoal_ID), open Wefor have executed we all the 16 pairs the learning process, thethe four triggerspair andtogoals will be 6.1because Affordances static environment theyin impact on the traversability of the In(Trigger=T 4, Goal=G 2) as example illustrate the (Trigger_ID, Goal_ID), for simplicity we take (Trigger=T 4, randomly, and theintwo doors could learning and testing result the following part.both be preplanned routine AsEnvironment this experiment is carried out in chosen 6.1 Affordances in Static Goal=G2) as the example pair to illustrate the learning and This subsection discusseswethe obstacles’ simulation environment, restrict the sizerollable of the traversed when they are open We have executed all the testing result in the following affordances inaacertain goal-free manner in staticrollable environment, pairs (Trigger_ID, Goal_ID), forpart simplicity take This subsection discusses theand obstacles’ affordan‐ learning rate α and discount factor γ are we initialized obstacles in scope, assume that a sphere is 16The 4, Goal=G2) as the example pair to illustrate the because they thestatic traversability of because the rollable whileimpact a manner cube on is in unrollable As a result, the (Trigger=T ces in a goal-free environment, learning rate α and factor For γ areevery initialized as as The 0.9 and respectively atdiscount the beginning 40 preplanned As environment this experiment is carried out inin learning testing resultrate in the following part affordance in static could be described they impactroutine onathe traversability of the preplanned routine episodes, isatdiscounted by 0.9 For If γ every

Định dạng
Số trang	10
Dung lượng	0,99 MB