Tối ưu hóa viễn thông và thích nghi Kỹ thuật Heuristic P13 doc

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	11
Dung lượng	217,5 KB

Nội dung

13 Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems George Bilchev and Sverrir Olafsson 13.1 Introduction As Internet connectivity is reaching the global community, information systems are becoming more and more distributed. Inevitably, this overnight exponential growth has also caused traffic overload at various places in the network. Until recently, it was believed that scaling the Internet was simply an issue of adding more resources, i.e. bandwidth and processing power could be brought to where they were needed. The Internet’s exponential growth, however, exposed this impression as a myth. Information access has not been and will not be evenly distributed. As it has been observed, user requests create ‘hot-spots’ of network load, with the same data transmitted over the same network links again and again. These hotspots are not static, but also move around, making it impossible to accurately predict the right network capacity to be installed. All these justify the requirement to develop new infrastructure for data dissemination on an ever-increasing scale, and the design of adaptive heuristics for traffic reduction. In this chapter, we develop a distributed file system model and use it as an experimental simulation tool to design, implement and test network adaptation algorithms. Section 13.2 describes in detail the distributed file system model and explains the implemented simulation environment. Two adaptation algorithms are developed in section 13.3. One is Telecommunications Optimization: Heuristic and Adaptive Techniques, edited by D.W. Corne, M.J. Oates and G.D. Smith © 2000 John Wiley & Sons, Ltd Telecommunications Optimization: Heuristic and Adaptive Techniques. Edited by David W. Corne, Martin J. Oates, George D. Smith Copyright © 2000 John Wiley & Sons Ltd ISBNs: 0-471-98855-3 (Hardback); 0-470-84163X (Electronic) Telecommunications Optimization: Heuristic and Adaptive Techniques 224 based on the ‘greedy’ heuristic principle and the other is a genetic algorithm tailored to handle the constraints of our problem. Experiments are shown in section 13.4, and section 13.5 gives conclusions and discusses possible future research directions. Figure 13.1 A schematic representation of the network and the distributed file system. 13.2 The Adaptation Problem of a Distributed File System The World Wide Web is rapidly moving us towards a distributed, interconnected information environment, in which an object will be accessed from multiple locations that may be geographically distributed worldwide. For example, a database of customers’ information can be accessed from the location where a salesmen is working for the day. In another example, an electronic document may be co-authored and edited by several users. In such distributed information environments, the replication of objects in the distributed system has crucial implications for system performance. The replication scheme affects the performance of the distributed system, since reading an object locally is faster and less costly than reading it from a remote server. In general, the optimal replication scheme of an object depends on the request pattern, i.e. the number of times users request the data. Presently, the replication scheme of a distributed database is established in a static fashion when the database is designed. The replication scheme remains fixed until the designer manually intervenes to change the number of replicas or their location. If the request pattern is fixed and known a priori, then this is a reasonable solution. However, in practice the request patterns are often dynamic and difficult to predict. Therefore, we need 1(7:25 . Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 225 an adaptive network that manages to optimize itself as the pattern changes. We proceed with the development of a mathematical model of a distributed information/file system. A distributed file system consists of interconnected nodes where each node i, i = 1,N has a local disk with capacity i d to store files – see Figures 13.1 and 13.2. There is a collection of M files each of size .,1, Mjs j = Copies of the files can reside on any one of the disks provided there is enough capacity. The communication cost ki c , between nodes i and k (measured as transferred bytes per simulated second) is also given. Figure 13.2 Users connect to each node from the distributed files system and generate requests. In our model each node runs a file manager which is responsible for downloading files from the network (Figure 13.3). To do that, each file manager i maintains an index vector l ij containing the location where each file j is downloaded from. User applications running on the nodes generate file requests the frequency of which can be statistically monitored in a matrix {p i,j }. To account for contention and to distribute the file load across the network it has been decided to model how busy the file managers are at each node k as follows: ∑ ∑ = = = ⋅ ⋅ = Mm Nn jji kl ji jji k sp sp b ji ,1 ,1 , , , , Thus, the response time of the file manager at node k can be expressed as waiting time in a buffer (Schwartz, 1987): Database Server User Community NETWORK Telecommunications Optimization: Heuristic and Adaptive Techniques 226 Figure 13.3 Each node runs a file manager (responsible for allocating files on the network) and a number of user applications which generate the requests.      ∞ > − = otherwise 1 kk kk k b b r τ τ where k τ reflects the maximum response capacity of the individual servers. The overall performance at node i can be measured as the time during which applications wait for files to download (i.e. response time): ji M j l li j i pr c s O ji ji , 1 , , , ⋅         += ∑ = The first term in the sum represents the time needed for the actual transfer of the data and the second term reflects the waiting time for that transfer to begin. The goal is to minimize the average network response time: {} ji N i M j l li j N l pr c s ji ji ji , 11 , 1 , , , min ⋅         + ∑∑ == The minimization is over the index matrix {l i,j }. There are two constraints: (1) the available disk capacity on each node should not be exceeded; and (2) each file must have at least one copy somewhere on the network. File Manager Application 1 Application Z ……. Multi- tasking NODE Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 227 13.2.1 The Simulation Environment In our distributed file system model the users generate file requests and the network responds by allocating and downloading the necessary files. The file requests can be statistically monitored and future requests predicted from observed patterns. The simulation environment captures these ideas by modeling the user file requests as random walks: )1()( ,, γ +−= tptp jiji where γ is drawn from a uniform distribution ),( r r U − . The parameter r determines the ‘randomness’ of the walk. If it is close to zero then )1()( ,, −≈ tptp jiji . During the simulated interval [t, t+1] the model has information about the file requests that have occurred in the previous interval [t–1, t]. Thus the dynamics of the simulation can be formally defined as: For t=1,2,3,… generate new file requests: { } ji tPtP , )1()( γ +−= simulate network: )()()( , 11 , 1 1 1 , , tpr c s tOtO ji N i M j l li j N N i i N ji ji ⋅         +== ∑∑∑ === An adaptive distributed file system would optimize its file distribution according to the user requests. Since the future user requests are not known and can only be predicted the optimization algorithm would have to use an expected value of the requests derived from previous observations: ))1((Prediction)( ~ −= tPtP Thus an adaptive distributed file system can be simulated as follows: For t=1,2,3,… file requests prediction: ))1(( Prediction )( −= tPtP optimization: ))( Optimize( )( tPt L = generate new file requests: { } ji tPtP , )1()( γ +−= simulate network: )()()( , 11 , 1 1 1 , , tpr c s tOtO ji N i M j l li j N N i i N ji ji ⋅         +== ∑∑∑ === The next section describes the developed optimization algorithms in detail. Telecommunications Optimization: Heuristic and Adaptive Techniques 228 13.3 Optimization Algorithms 13.3.1 Greedy Algorithm The ‘greedy’ principle consists of selfishly allocating resources (provided constraints allow it) without regard to the performance of the other members of the network (Cormen et al., 1990). While greedy algorithms are optimal for certain problems (e.g. the minimal spanning tree problem) in practice they often produce only near optimal solutions. Greedy algorithms, however, are very fast and are usually used as a heuristic method. The greedy approach seems very well suited to our problem since the uncertainties in the file request prediction mean that we never actually optimize the real problem, but our expectation of it. The implemented greedy algorithm works as follows: For each file j check every node i to see if there is enough space to accommodate it and if enough space is available calculate the response time of the network if file j was at node i: jk N k i ik j pr c s , 1 , ⋅         + ∑ = After all nodes are checked copy the file to the best found node. The above described algorithm loads only one copy of each file into the distributed file system. If multiple copies are allowed, then add copies of the files in the following way: For each node i get the most heavily used file (i.e., )(max , jji j sp ⋅ ) which is not already present. Check if there is enough space to accommodate it. If yes, copy it. Continue until all files are checked. 13.3.2 Genetic Algorithm Genetic Algorithms (GAs) are very popular due to their simple idea and wide applicability (Holland, 1975; Goldberg, 1989). The simple GA is a population-based search in which the individuals (each representing a point from the search space) exchange information (i.e. reproduce) to move through the search space. The exchange of information is done through operators (such as mutation and crossover) and is based on the ‘survival of the fittest’ principle, i.e. better individuals have greater chance to reproduce. It is well established that in order to produce good results the basic GA must be tailored to the problem at hand by designing problem specific representation and operators. The Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 229 overall flow of control in our implemented GA for the distributed files system model is similar to the steady state genetic algorithm described in Chapter 1. In order to describe the further implementation details of our GA we need to answer the following questions: How are the individuals represented? How is the population initialized? How is the selection process implemented? What operators are used? Individuals representation: each individual from the population represents a distribution state for the file system captured by the matrix { } ji l , . Initialization: it is important to create a random population of feasible individuals. The implemented initialization process randomly generates a node index for each object and tries to accommodate it on that node. In case of a failure the process is repeated for the same object. Selection process: the individuals are first linearly ranked according to their fitness and then are selected by a roulette-wheel process using their rank value. Operators: the main problem is the design of operators which preserve feasibility of the solutions (Bilchev and Parmee, 1996). This is important for our problem since it intrinsically has two constraints: (i) the disk capacity of the nodes must not be exceeded; and (ii) each file must have at least one copy somewhere on the network. (If feasibility is not preserved by the operators, then the fitness function would require to be modified by an appropriate penalty function in order to drive the population into the feasible region.) We have developed two main operators both preserving feasibility: The new operators developed in this work are called Safe-add and Safe-delete. Safe-add works as follows: For each node randomly select and copy a file which is not already locally present and whose size is smaller than the available disk space. Check to see if any of the nodes would respond faster by downloading files from the new locations and if yes, update the matrix {} ji l , . Safe-delete is as follows: For each node randomly select and delete a file provided it is not the last copy. Update the matrix {} ji l , to reflect on the above changes. In our experiments we have used a population size of 70 individuals for 30 generations. During each generation 50 safe-add and three safe-delete operators are applied. During the selection process the best individual has 5% more chances of being selected as compared to the second best, and so on. Telecommunications Optimization: Heuristic and Adaptive Techniques 230 13.4 Simulation Results In our experiments we start with an offline simulation during which the optimization algorithms are run when the file system is not used (i.e. overnight, for example). In this scenario, we assume that both algorithms have enough time to finish their optimization before the file system is used again. A typical simulation is shown in Figure 13.4. Tests are done using seven nodes and 100 files. All simulation graphs start from the same initial state of the distributed file system. Then the two optimization algorithms are compared against a non-adaptive (static) file system (i.e. when no optimization is used). The experiments undoubtedly reveal that the adaptive distributed file system produces better results as compared to a static file system. The graphs also clearly indicate the excellent performance of the GA optimizer, which consistently outperforms the greedy algorithm. Figure 13.4 An offline simulation. Adaptive distributed file systems utilizing a genetic algorithm and a greedy algorithm respectively are compared against a static distributed file system. The experiments use seven nodes and 31 files. To show the effect of delayed information, we run the greedy algorithm once using the usage pattern collected from the previous simulation step P(t–1) (which are available in practice) and once using the actual P(t) (which is not known in practice). The difference in performances reveals how much better we can do if perfect information were available (Figure 13.5). Static file system Greedy algorithm Genetic algorithm t 1 50 100 O(t) Response time Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 231 Figure 13.5 A greedy algorithm with perfect information is compared to a greedy algorithm with delayed information. In practice, we only have delayed information. Figure 13.6 Online simulation. The circles indicate when the GA optimization takes place. The GA/greedy algorithm ratio is 60 (i.e. the GA is run once for every 60 runs of the greedy algorithm). Static file system Greedy algorithm Genetic algorithm t O(t) Response time 1 50 100 Greedy algorithm with perfect information Greedy algorithm t 1 15 30 O(t) Response time Telecommunications Optimization: Heuristic and Adaptive Techniques 232 Figure 13.7 Online simulation. The GA/greedy algorithm ratio is 40. This is the critical ratio where the average performance of both algorithm is comparable. Figure 13.8 Online simulation. The GA/greedy algorithm ratio is 30. The GA manages to maintain its performance advantage. t O(t) Response time 1 50 100 Static file system Genetic algorithm Greedy algorithm Static file system Greedy algorithm Genetic algorithm t O(t) Response time 1 50 100 [...]...Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 233 Next we consider online simulations In an online simulation a faster optimization algorithm would be executed at a greater rate than... in a real network where there are limited resources and slow communication links The simulation results confirmed our initial hypothesis that a tailored genetic algorithm would outperform ‘classical’ heuristics such as the greedy algorithm However, while being able to find better solutions, the GA is considerably slower and doesn’t scale as well as the greedy algorithm Saying that, it is important . is Telecommunications Optimization: Heuristic and Adaptive Techniques, edited by D.W. Corne, M.J. Oates and G.D. Smith © 2000 John Wiley & Sons, Ltd Telecommunications Optimization: Heuristic and Adaptive. (Hardback); 0-470-84163X (Electronic) Telecommunications Optimization: Heuristic and Adaptive Techniques 224 based on the ‘greedy’ heuristic principle and the other is a genetic algorithm tailored. 13 Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems George Bilchev and Sverrir Olafsson 13.1

Ngày đăng: 01/07/2014, 10:20

Xem thêm