Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 276 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
276
Dung lượng
4,46 MB
Nội dung
2356 Web Content Recommendation Methods Based on Reinforcement Learning tions (Burke, 2000). Most of these recommenders employ some kind of knowledge-based decision rules for recommendation. This type of recom- mendation is heavily dependant on knowledge engineering by system designers to construct a rule base in accordance to the specific character- istics of the domain. While the user profiles are generally obtained through explicit interactions with users, there have also been some attempts at exploiting machine learning techniques for automatically deriving decision rules that can be used for personalization, e.g. (Pazzani, 1999). In Content-based filtering systems, the user profile represents a content model of items in which that user has previously shown interest (Pazzani & Bilsus, 2007).These systems are rooted in informa- tion retrieval and information filtering research. The content model for an item is represented by a set of features or attributes characterizing that item. The recommendation generation is usually comprised of comparing extracted features from new items with content model in the user profile and recommending items that have adequate similarity to the user profile. Collaborative techniques (Resnick & Varian, 1997; Herlocker et al., 2000) are the most suc- cessful and the most widely used techniques in recommender systems, e.g. (Deshpande & Kary- pis, 2004; Konstan et al., 1998; Wasfi, 1999). In the simplest from, in this class of systems, users are requested to rate the items they know and then the target user will be recommended the items that people with similar tastes had liked in the past. Recently, Web mining and especially web usage mining techniques have been used widely in web recommender systems (Cooley et al., 1999; Fu et al., 2000; Mobasher et al., 2000a; Mobasher et al., 2000b). Common approach in these systems is to extract navigational patterns from usage data by data mining techniques such as association rules and clustering, and making recommendations based on the extracted patterns. These approaches differ fundamentally from our method in which no static pattern is extracted from data. More recently, systems that take advantage of a combination of content, usage and even structural information of the websites have been introduced and shown superior results in the web page recom- mendation problem (Li & Zaiane, 2004; Mobasher et al., 2000b; Nakagawa & Mobasher, 2003). In (Nakagawa & Mobasher, 2003) the degree of connectivity based on the link structure of the website is used to choose from different usage based recommendation techniques, showing that sequential and non-sequential techniques could each achieve better results in web pages with different degrees of connectivity. A new method for generating navigation models is presented in (Li & Zaiane, 2004) which exploits the usage, content and structure data of the website. This method introduces the concept of user’s mis- sions to represent users’ concurrent information needs. These missions are identified by finding content-coherent pages that the user has visited. Website structure is also used both for enhancing the content-based mission identification and also for ranking the pages in recommendation lists. In another approach (Eirinaki et al., 2004, 2003) the content of web pages is used to augment usage profiles with semantics, using a domain-ontology and then performing data mining on the augmented profiles. Most recently, concept hierarchies were incorporated in a novel recommendation method based on web usage mining and optimal sequence alignment to find similarities between user ses- sions in (Bose et al., 2007). Markov Decision Process and Reinforcement Learning Reinforcement learning (Sutton & Barto, 1998) is primarily known in machine learning research as a framework in which agents learn to choose the optimal action in each situation or state they are in. The agent is supposed to be in a specific state s, in each step it performs some action and transits to another state. After each transition the agent receives a reward. The goal of the agent is 2357 Web Content Recommendation Methods Based on Reinforcement Learning to learn which actions to perform in each state to receive the greatest accumulative reward, in its path to the goal states. The set of actions chosen in each state is called the agent’s policy. One variation of this method is Q-Learning in which the agent does not compute explicit values for each state and instead computes a value function Q(s,a) which indicates value of performing action a in state s (Sutton & Barto, 1998; Mitchell, 1997). Formally the value of Q(s,a) is the discounted sum of future rewards that will be obtained by doing action a in s and subsequently choosing optimal actions. In order to solve the problem with Q-Learning we need to make appropriate definitions for our states and actions, consider a reward function suiting the problem and devise a procedure to train the system using web logs available to us. The learning process of the agent can be for- malized as a Markov Decision Process (MDP). The MDP model of the Problem includes: 1. Set of states S, which represents the differ- ent ‘situations’ that the agent can observe. Basically, a state s in S must define what is important for the agent to know in order to take a good action. For a given situation, the complete set of states is called the state space. 2. Set of possible actions A, that the agent can perform in a given state s (s Î S) and that will produce a transition into a next state s’ Î S. As we mentioned, the selection of the particular action depends on the policy of the agent. We formally define the policy as a function that indicates for each state s, the action a Î A taken by the agent in that state. In general, it is assumed that the en- vironment, with which the agent interacts, is non-deterministic, i.e., after executing an action, the agent can transit into many alternative states. 3. Reward function rew(s, a) which assigns a scalar value, also known as the immediate reward, to the performance of each action a Î A taken in state s Î S. For instance, if the agent takes an action that is satisfactory for the user, then the agent should be rewarded with a positive immediate reward. On the other hand, if the action is unsatisfactory, the agent should be punished through a negative reward. However, the agent cannot know the reward function exactly, because the reward is assigned to it through the environment. This function can play a very important role in an MDP problem. 4. Transition function T(s, a, s’) which gives the probability of making a transition from state s to state s’ when the agent performs the action a. This function completely describes the non-deterministic nature of the agent’s environment. Explicit use of this function can be absent in some versions of Q-Learning. Reinforcement Learning in Recommender Systems Reinforcement Learning (RL) has been previ- ously used for recommendations in several ap- plications. Web Watcher (Joachims et al., 1997), exploits Q-Learning to guide users to their desired pages. Pages correspond to states and hyperlinks to actions, rewards are computed based on the similarity of the page content and user profile keywords. There are fundamental differences between Web Watcher and our approach, two of the most significant are: (a) our approach requires no explicit user interest profile in any form, and (b) unlike our method, Web Watcher makes no use of previous usage based data. In most other systems, reinforcement learning is used to reflect user feedback and update current state of recommendations. A general framework is presented in (Golovin and Rahm, 2004), which consists of a database of recommendations gen- erated by various models and a learning module that updates the weight of each recommendation by user feedback. In (Srivihok & Sukonmanee, 2005) a travel recommendation agent is introduced 2358 Web Content Recommendation Methods Based on Reinforcement Learning which considers various attributes for trips and customers, computes each trip’s value with a linear function and updates function coefficients after receiving each user feedback. RL is used for information filtering in (Zhang & Seo, 2001) which maintains a profile for each user containing keywords of interests and updates each word’s weight according to the implicit and explicit feedbacks received from the user. In (Shany et al., 2005) the recommendation problem is mod- eled as an MDP. The system’s states correspond to user’s previous purchases, rewards are based on the profit achieved by selling the items and the recommendations are made using the theory of MDP and their novel state-transition function. In a more recent work (Mahmood & Ricci, 2007) RL is used in the context of a conversational travel recommender system in order to learn optimal interaction strategies. They model the problem with a finite state-space based on variables like the interaction stage, user action and the result size of a query. The set of actions represent what the system chooses to perform in each state e.g. ex- ecuting a query, suggesting modification. Finally RL is used to learn an optimal strategy, based on a user behavior model. To the best of our knowl- edge our method differs from previous work, as none of them used reinforcement learning to train a system in making web site recommendations merely from web usage data. REINFORCEMENT LEARNING FOR USAGE-BASED WEB PAGE RECOMMENDATION The specific problem which our system is sup- posed to solve, can be summarized as follows: the system has, as input data, the log file of users’ past visits to the website, these log files are assumed to be in any standard log format, containing records each with a user ID, the sequence of pages the user visited during a session and typically the time of each page request. A user session is defined as a sequence of temporally compact accesses by a user. Since web servers do not typically log usernames, sessions are considered as accesses from the same IP address such that they satisfy some constraints, e.g. the duration of time elapsed between any two consecutive accesses in the ses- sion is within a pre-specified threshold (Cooley et. al, 1999). A user enters our website and begins request- ing web pages, like a typical browser mostly by following the hyperlinks on web pages. Consider- ing the pages this user has requested so far, the system has to predict in what other pages the user is probably interested and recommend them to her. Table 1 illustrates a sample scenario. Predictions are considered successful if the user chooses to visit those pages in the remaining of that session, e.g. page c recommended in the first step in Table 1. Obviously the goal of the system would be to make the most successful recommendations. Modeling Recommendations as a Q-Learning Problem Using the Analogy of a Game In order to better represent our approach toward the problem we try to use the notion of a game. In a typical scenario a web user visits pages se- quentially from a web site, let’s say the sequence a user u requested is composed of pages a, b, c and d. Each page the user requests can be considered a step or move in our game. After each step the user takes, it will be the system’s turn to make a move. The system’s purpose is to predict user’s next move(s) with the knowledge of his previous moves. Whenever the user makes a move (requests a page), if the system has previously predicted the move, it will receive positive points and otherwise it will receive none or negative points. For example predicting a visit of page d after viewing pages a and b by the user in the above example yields in positive points for the system. The ultimate goal of the system would be to gather as much points 2359 Web Content Recommendation Methods Based on Reinforcement Learning as possible during a game or actually during a user visit from the web site. Some important issues can be inferred from this simple analogy: first of all, we can see the problem certainly has a stochastic nature and like most games, the next state cannot be computed deterministically from our current state and the action the system performs due to the fact that the user can choose from a great number of moves. This must be considered in our learning algorithm and our update rules for Q values; the second issue is what the system actions should be, as they are what we ultimately expect the system to perform. Actions will be prediction or recom- mendation of web pages by the system in each state. Regarding the information each state must contain, by considering our definition of actions, we can deduct that each state should at least show the history of pages visited by the user so far. This way we’ll have the least information needed to make the recommendations. This analogy also determines the basics of rewarding function. In its simplest form it shall consider that an action should be rewarded positively if it recommends a page that will be visited in one of the consequent states, not necessarily the immediate next state. Of course, this would be an over simplification and in practice the reward would depend on various factors described in the coming sections. One last issue which is worth noting about the analogy is that this game cannot be categorized as a typical 2-player game in which opponents try to defeat each other, as in this game clearly the user has no intention to mislead the system and prevent the system from gathering points. It might be more suitable to consider the problem as a competition for different recommender systems to gather more points, than a 2-player game. Because of this in- trinsic difference, we cannot use self-play, a typical technique used in training RL systems (Sutton & Barto, 1998) to train our system and we need the actual web usage data for training. Modeling States and Actions Considering the above observations we begin the definitions. We tend to keep our states as simple as possible, at least in order to keep their number manageable. Regarding the states, we can see keeping only the user trail can be insufficient. With that definition it won’t be possible to reflect the effect of an action a performed in state s i , in any consequent state s i+n where n>1. This means the system would only learn actions that predict the immediate next page which is not the purpose of our system. Another issue we should take into account is the number of possible states: if we allow the states to contain any given sequence of page visits clearly we’ll be potentially faced by an infinite number of states. What we chose to do was to limit the page visit sequences to a constant number. For this purpose we adopted the notion of N-Grams which is commonly applied in similar personalization systems based on web usage mining (Mobasher et al., 2000a; Mobasher et al., 2000b). In this model we put a sliding win- dow of size w on user’s page visits, resulting in states containing only the last w pages requested by the user. The assumption behind this model is that knowing only the last w page visits of the user, gives us enough information to predict his future page requests. The same problem rises when considering the recommended pages’ sequence in Table 1. A sample user session and system recommendations Visited Page a b c d e f Navigation Trail a ab abc abcd abcde abcdef System Prediction c d e s f h 2360 Web Content Recommendation Methods Based on Reinforcement Learning the states, for which we take the same approach of considering w’ last recommendations. Regarding the actions, we chose simplicity. Each action is a single page recommendation in each state. Considering multiple page recom- mendations might have shown us the effect of the combination of recommended pages on the user, in the expense of making our state space and rewarding policy much more complicated. Thus, we consider each state s at time t consist- ing of two sequences V, R indicating the sequence of visited and previously recommended pages respectively: Vvvv Rrrr stwtwt stwtwt =< > =< > -+ -+ - ¢ +- ¢ + 12 12 ,, , ,, , (1) Where v t-w+i indicates the i th visited page in the state and r t-w+i indicates the i th recommended page in the state s. The corresponding states and actions of the user session of Table 1 are presented in Figure 1, where straight arrows represent the actions performed in each state and the dashed arrows represent the reward received for perform- ing each action. Choosing a Reward Function The basis of reinforcement learning lies in the rewards the agent receives, and how it updates state and action values. As with most stochastic environments, we should reward the actions per- formed in each state with respect to the consequent state resulted both from the agent’s action and other factor’s in the environment on which we might not have control. These consequent states are sometimes called the after-states (Sutton & Barto, 1998). Here this factor is the page the user actually chooses to visit. We certainly do not have a predetermined function rew(s,a) or even a state transitionfunctionδ(s, a) which gives us the next state according to current state s and performed action a. It can be inferred that the rewards are dependent on the after state and more specifically on the intersection of previously recommended pages in each state and current page sequence of the state. Reward for each action would be a function of V s’ and R s’ where s’ is our next state. One tricky issue worth considering is that though tempting, we should not base on rewards on |V s ’∩R s’ | since it will cause extra credit for a single correct move. Considering the above example a recommenda- tion of page b in the first state shall be rewarded only in the transition to the second state where user goes to page b, while it will also be present in our recommendation list in the third state. To avoid this, we simply consider only the occurrence of the last visited page in state s', in the recom- mended pages list to reward the action performed in the previous sate s. To complete our rewarding procedure we take into account common metrics used in web page recommender systems. One is- sue is considering when the page was predicted by the system and when the user actually visited the page. According to the goal of the system this might influence our rewarding. If we consider shortening user navigation as a sign of successful guidance of user to his required information, as is Figure 1. States and actions in the recommendation problem 2361 Web Content Recommendation Methods Based on Reinforcement Learning the most common case in recommender systems (Li & Zaiane, 2004; Mobasher et al., 2000a) we should consider a greater reward for pages pre- dicted sooner in the user’s navigation path and vice versa. Another factor commonly considered in theses systems (Mobasher et al., 2000a; Liu et al., 2004; Fu et al., 2000) is the time the user spends on a page, assuming the more time the user spends on a page the more interested he probably has been in that page. Taking this into account we should reward a successful page recommendation in accordance with the time the user spends on the page. The rewarding can be summarized as follows: Algorithm 1. Usage Based Reward Function 1: Assume d(, )sa s= ¢ 2: KV Rv R ssws ts ¢¢ ¢ + ¢ =Ç=Ç , 1 3: IfK s’ ≠Ø 4: For eachpage kinK s’ 5: rew(s,a) += UBR(Dist(R s′ , k),Time(v t+1 )) 6: End For 7: End If In line 1, d(, )sa s= ¢ shows that the transition of the system to the next state s’ after perform- ing a in state s. K s’ represents the set of correct recommendations in each step and rew(s,a) is the reward of performing action a in state s. Dist(R s′ ,k) is the distance of page k from the end of the rec- ommended pages list in state s’ and Time(v t+1 ) indicates the time user has spent on the last page of the state. Here, UBR is the Usage-Based Reward function, combining these values to calculate the reward function rew(s,a). We chose a simple linear combination of these values as Equation (2): UBRDistTimeDistTime(, ) =´ +´ab (2) Where ab+=1 andbothαandβinclude a normalizing factor according to the maximum values dist and time can take. The last modification we experimented was changing our reward function. We noticed as we put a sliding window on our sequence of previously recommended pages, practically we had limited the effect of each action to w’ next states as can be seen in Figure 2. As can be seen in the example presented in this figure, a correct recommendation of page f in state s i will not be rewarded in state s i+3 when using a window of size 2 on the R sequence (w’=2). After training the system using this definition, the system was mostly successful in recommending pages visited around w’ steps ahead. Although this might be quite acceptable while choosing an appropriate value for w’, it tends to limit system’s prediction ability as large numbers of w’ make our state space enormous. To overcome this problem, we devised a rather simple modification in our reward function: what we needed was to reward recom- mendation of a page if it is likely to be visited an unknown number of states ahead. Fortunately our definition of states and actions gives us just the information we need and this information is stored in Q values of each state. The basic idea is that when an action/recommendation is appropriate in state s i , indicating the recommended page is likely to occur in the following states, it should also be considered appropriate in state s i-1 and the actions in that state that frequently lead to s i . Fol- lowing this recursive procedure we can propagate the value of performing a specific action beyond the limits imposed by w’. This change is easily reflected in our learning system by considering value of Q(s’,a) in computation of rew(s,a) with a coefficientlikeγ.Itshouldbetakenintoaccount that the effect of this modification in our reward 2362 Web Content Recommendation Methods Based on Reinforcement Learning function must certainly be limited as in its most extreme case where we only take this next Q value into account we’re practically encouraging recom- mendation of pages that tend to occur mostly in the end of user sessions. Having put all the pieces of the model together, we can get an initial idea why reinforcement learning might be a good candidate for the rec- ommendation problem: it does not rely on any previous assumptions regarding the probability distribution of visiting a page after having visited a sequence of pages, which makes it general enough for diverse usage patterns as this distribution can take different shapes for different sequences. The nature of this problem matches perfectly with the notion of delayed reward or what is commonly known as temporal difference: the value of per- forming an action/recommendation might not be revealed to us in the immediate next state and sequence of actions might have led to a success- ful recommendation for which we must credit rewards. What the system learns is directly what it should perform, though it is possible to extract rules from the learned policy model, its decisions are not based on explicitly extracted rules or pat- terns from the data. One issue commonly faced in systems based on patterns extracted from training data is the need to periodically update these pat- terns in order to make sure they still reflect the trends residing in user behavior or the changes of the site structure or content. With reinforcement learning the system is intrinsically learning even when performing in real world, as the recommen- dations are the actions the system performs, and it is commonplace for the learning procedure to take place during the interaction of system with its environment. Training the System We chose Q-Learning as our learning algorithm. This method is primarily concerned with estimat- ing an evaluation of performing specific actions in each state, known as Q-values. Each Q(s,a) indicates an estimate of the accumulative reward achievable, by performing action a in state s and performing the action a’ with highest Q(s’,a’) in each future state s’. In this setting we are not concerned with evaluating each state in the sense of the accumulative rewards reachable from this state, which with respect to our system’s goal can be useful only if we can estimate the probability of visiting the following states by performing each action. On the other hand Q-Learning provides us with a structure that can be used directly in the recommendation problem, as recommenda- tions in fact are the actions and the value of each recommendation/action shows an estimation of how successful that prediction can be. Another decision is the update rule for Q values. Because of the non-deterministic nature of this problem we use the following update rule (Sutton & Barto, 1998): QsaQsa rsaQsa a nnnn a n (,)( )(,) [(,) max((, ), )]=- ++ ¢ - ¢ - 1 11 aagd (3) Figure 2. An example of limited action effectiveness due to the size of the recommendation window 2363 Web Content Recommendation Methods Based on Reinforcement Learning With a n n visits sa = + 1 1 (, ) (4) Where Q n (s,a) is the Q-Value of performing a in state s after n iterations, and visits n (s,a) indicates the total number of times this state-action pair, i.e. (s,a), has been visited up to and including the n th iteration. This rule takes into account the fact that doing the same action can yield differ- ent rewards each time it is performed in the same state. The decreasing value of a n causes these values to gradually converge and decreases the impact of changing reward values as the training continues. What remains about the training phase is how we actually train the system using web usage logs available. As mentioned before these logs consist of previous user sessions in the web site. Considering the analogy of the game they can be considered as a set of opponent’s previous games and the moves he tends to make. We are actually provided with a set of actual episodes occurred in the environment, of course with the difference that no recommendations were actually made during these episodes. The training process can be summarized as Figure 3. Algorithm 2: One important issue in the training procedure is the method used for action selection. One obvious strategy would be for the agent in each state s to select the action a that maximizes Q(s,a) hereby exploiting its current approximation. However, with this greedy strategy there’s the risk of over- committing to actions that are found during early training to have high Q values, while failing to explore other actions that might have even higher values (Mitchell, 1997). For this reason, it is common in Q learning to use a probabilistic ap- proach to selecting actions. A simple alternative is to behave greedily most of the time, but with smallprobability ε, insteadselect anaction at random. Methods using this near-greedy action selectionrulearecalledε-greedymethods(Sutton & Barto, 1998). The choice of ε-greedy action selection is quite important for this specific problem as the exploration especially in the beginning phases of training, is vital. The Q values will converge if each episode, or more precisely each state-action pair is visited infinitely. In our implementation of the problem convergence was reached after a few thousand (between 3000 and 5000) visits of each episode. This definition of the learning algorithm completely follows a TD(0) off-policy learning procedure (Sutton & Barto, 1998), as we take an estimation of future reward accessible from each state after performing each action by considering the maximum Q value in the next state. Figure 3. Algorithm 2: Training procedure 2364 Web Content Recommendation Methods Based on Reinforcement Learning EXPERIMENTAL EVALUATION OF THE USAGE BASED APPROACH We evaluated system performance in the different settings described above. We used simulated log files generated by a web traffic simulator to tune our rewarding functions. The log files were simu- lated for a website containing 700 web pages. We pruned user sessions with a length smaller than 5 and were provided with 16000 user sessions with average length of eight. As our evaluation data set we used the web logs of the Depaul University website, one of the few publicly available and widely used datasets, made available by the au- thor of (Mobasher et al., 2000a). This dataset is pre-processed and contains 13745 user sessions in their visits on 687 pages. These sessions have an average length around 6. The website structure is categorized as a dense one with high connectivity between web pages according to (Nakagawa & Mobasher, 2003). 70% of the data set was used as the training set and the remaining was used to test the system. For our evaluation we presented each user session to the system, and recorded the recommendations it made after seeing each page the user had visited. The system was allowed to make r recommendations in each step with r<10 and rO v < where O v is the number of outgoing links of the last page v visited by the user. This limitation on number of recommenda- tions is adopted from (Li & Zaiane, 2004). The recommendation set in each state is composed by selecting the top-r actions of the sates with the highest Q-values, again by a variation of the ε-greedyactionselectionmethod. Evaluation Metrics To evaluate the recommendations we use the metrics presented in (Li & Zaiane, 2004) because of the similarity of the settings in both systems and the fact that we believe these co-dependent metrics can reveal the true performance of the system more clearly than simpler metrics. Rec- ommendation Accuracy and Coverage are two metrics quite similar to the precision and recall metrics commonly used in information retrieval literature. Recommendation accuracy measures the ratio of correct recommendations among all recom- mendations, where correct recommendations are the ones that appear in the remaining of the user session. If we have M sessions in our test log, for each visit session m after considering each page p, the system generates a set of recommenda- tions Rec(p). To compute the accuracy, Rec(p) is compared with the rest of the session Tail(p) as Equation (5). This way any correct recommenda- tion is evaluated exactly once. Accuracy Tail pcp cp M p p m = å (()Re()) Re () (5) Recommendation coverage on the other hand shows the ratio of the pages in the user session that the system is able to predict before the user visits them: Coverage Tail pcp Tail p M p p m = å (()Re()) () (6) As is the case with precision and recall, these metrics can be useful indicators of the system performance only when used in accordance to each other and lose their credibility when used individually. As an example, consider a system that recommends all the pages in each step, this system will gain 100% coverage, of course in the price of very low accuracy. Another metric used for evaluation is called the shortcut gain which measures how many page-visits users can save if they follow the recom- mendations. The shortened session is derived by eliminating the intermediate pages in the session 2365 Web Content Recommendation Methods Based on Reinforcement Learning that the user could escape visiting, by following the recommendations. A visit time threshold is used on the page visits to decide which pages are auxiliary pages as proposed by Li and Zaiane (2004). If we call the shortened session m’, the shortcut gain for each session is measured as follows: ShortcutGain mm m M = -|||'| || (7) Evaluation Results In the first set of experiments we tested the effect of different decisions regarding state definition, rewarding function, and the learning algorithm on the system behavior. Afterwards we compared the system performance to the other common tech- niques used in recommendation systems. Sensitivity to Active Window Size on User Navigation Trail In our state definition, we used the notion of N-Grams by putting a sliding window on user navigation paths. The implication of using a sliding window of size w is that we base the prediction of user future visits on his w past visits. The choice of this sliding window size can affect the system in several ways. A large sliding window seems to provide the system a longer memory while on the other hand causing a larger state space with sequences that occur less frequently in the usage logs. We trained our system with different window sizes on user trail and evaluated its performance as seen in Figure 4. In these experiments we used a fixed window size of 3 on recommendation history. As our experiments show the best results are achieved when using a window of size 3. It can be inferred form this diagram that a window of size 1 which considers only the user’s last page visit does not hold enough information in memory to make the recommendation, the accuracy of recommendations improve with increasing the window size and the best results are achieved with a window size of 3. Using a window size larger than 3 results in weaker performance (only shown up to w=4 in Figure 4 for the sake of readability), it seems to be due to the fact that, as mentioned above, in these models, states contain sequences of page visits that occur less frequently in web usage logs, causing the system to make decisions based on weaker evidence. In our evaluation of the short cut gain there was a slight difference when using different window sizes. Sensitivity to Active Window Size on Recommendations In the next step we performed similar experiments, this time using a constant sliding window of size 3 on user trail and changing size of active window on recommendations history. As this window size was increased, rather interesting results was achieved as shown in Figure 5. In evaluating system accuracy, we observed improvement up to a window of size 3, after that increasing the window size caused no improve- ment while resulting in larger number of states. This increase in the number of states is more intense than when the window size on user trail was increased. This is manly due to the fact that the system is exploring and makes any combina- tion of recommendations to learn the good ones. The model consisting of this great number of states is in no way efficient, as in our experiments on the test data only 25% of these states were actually visited. In the sense of shortcut gain the system achieved, it was observed that shortcut gain increased almost constantly with increase in window size, which seems a natural consequence as described in section “Reinforcement learning for usage-based web page recommendation”. [...]... doi :10. 1023/A :100 6544522159 Pazzani, M., & Billsus, D Content-based recommendation systems In P Brusilovsky, A Kobsa, and W Nejdl (Eds.), The Adaptive Web: Methods and Strategies of Web Personalization, Lecture Notes in Computer Science 4321 (pp 325-341) Berlin, Heidelberg, Germany: Springer-Verlag Resnick, P., & Varian, H R (1997) Recommender systems Communications of the ACM, 40(3), 56–58 doi :10. 1145/24 5108 .245121... soft computing techniques for the development of Web personalization systems and overviews existing systems for Web personalization based on SC methods In Section 4 we describe a neurofuzzy Web personalization framework and show its application to a Web site taken as case study Section 5 closes the chapter by drawing conclusive remarks Web Personalization Web personalization is intended as the process... collected on the Web and extract useful knowledge leading to the so-called Web mining (Eirinaki & Vazirgiannis, 2003; Etzioni, 1996; Kosala & Blockeel, 2000; Mobasher, 2007a; Pal et al., 2002) Web mining refers to a special case of data mining which deals with the extraction of interesting and useful knowledge from Web data Three important subareas can be distinguished in Web mining: • • • Web content mining:... knowledge from the content of Web pages (e.g., textual data included in a Web page such as words or also tags, pictures, downloadable files, etc.) Web structure mining: Extraction of knowledge from the structural information present into Web pages (e.g., links to other pages) Web usage mining: Extraction of knowledge from usage data generated by the visits of the users to a Web site Generally, 2383 On... and map are removed) Also, records corresponding to failed user requests and accesses generated by Web robots are identified and eliminated from log data Web robots (also known as Web crawlers or Web spiders) are programs which traverse the Web in a methodical and automated manner, downloading complete Web sites in order to update the index On the Use of Soft Computing Techniques of a search engine... available on the World Wide Web has made more severe the need for effective methods of personalization for the Web information space The abundance of information combined with the heterogeneous nature of the Web makes Web site exploration difficult for ordinary users, who often obtain erroneous or ambiguous replies to their requests This has led to a considerable interest in Web personalization which... Interaction, 12(4), 331–370 doi :10. 1023/A :102 1240730564 Chi, E H., Pirolli, P., & Pitkow, J (2001).Using information scent to model user information needs and actions on the web Proceedings of the ACM SIG-CHI on Human Factors in Computing Systems (pp.490-497) Seattle, WA, USA: ACM Press Cooley, R., Mobasher, B., & Srivastava, J (1999) Data preparation for mining World Wide Web browsing patterns Knowledge... between the Web site and the visitorcustomer Mobasher, Cooley, and Srivastava (1999) simply define Web personalization as the task of making Web- based information systems adaptive to the needs and interests of individual users Typically, a personalized Web site recognizes its users, collects information about their preferences, and adapts its services in order to match the users’ needs Web personalization... (2000) Automatic personalization based on Web usage mining Communications of the ACM, 43(8), 142–151 doi :10. 1145/345124.345169 Mobasher, B., Dai, H., Luo, T., Sun, Y., & Zhu, J (2000) Integrating web usage and content mining for more effective personalization In K Bauknecht, S K Madria, G Pernul (Eds.), Proceeding of First International Conference E-Commerce and Web Technologies, Lecture Notes in Computer... system with common web usage logs System performance was evaluated under different settings and in comparison with other methods Our experiments showed promising results achieved by exploiting reinforcement learning in web recommendation based on web usage logs Afterwards we described a method to enhance our solution based on reinforcement learning, devised for web recommendations from web usage data We . people with similar tastes had liked in the past. Recently, Web mining and especially web usage mining techniques have been used widely in web recommender systems (Cooley et al., 1999; Fu et al.,. reinforcement learning to train a system in making web site recommendations merely from web usage data. REINFORCEMENT LEARNING FOR USAGE-BASED WEB PAGE RECOMMENDATION The specific problem which. threshold (Cooley et. al, 1999). A user enters our website and begins request- ing web pages, like a typical browser mostly by following the hyperlinks on web pages. Consider- ing the pages this user