Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 40 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
40
Dung lượng
761,46 KB
Nội dung
214 Parallel Matrix Multiplication The compute method creates the strip processes using send invocations and then waits for them to complete using a semaphore. That code could not be replaced (only) by declaring the equivalent family of processes (using the process abbreviation) because those processes might execute before the code in the compute method initializes instance variables used within strip . (See Exercise 15.3.) Many shared-memory multiprocessors employ caches, with one cache per processor. Each cache contains the memory blocks most recently referenced by the processor. (A block is typically a few contiguous words.) The purpose of caches is to increase performance, but they have to be used with care by the programmer or they can actually decrease performance (due to cache conflicts). Reference [22] gives three rules-of-thumb programmers need to keep in mind: Perform all operations on a variable, especially updates, in one process. Align data so that variables updated by different processors are in different cache blocks. Reuse data quickly when possible so it remains in the cache and does not get “spilled” back to main memory. A two-dimensional array in Java is an array of references to single- dimensional arrays. So, a matrix is stored in row-major order (i.e., by rows), although adjacent rows are not necessarily contiguous. The above program, therefore, uses caches well. Each strip process reads one distinct strip of A and writes one distinct strip of C , and it references elements of A and C by sweeping across rows. Every process references all elements of B, but that is unavoidable. (If B were transposed, so that columns were actually stored in rows, it too could be referenced efficiently.) 15.2 Dynamic Scheduling: A Bag of Tasks 215 15.2 Dynamic Scheduling: A Bag of Tasks The algorithm in the previous section statically assigned an equal amount of work to each strip process. If the processes execute on homogeneous processors without interruption, they would be likely to finish at about the same time. However, if the processes execute on different-speed processors, or if they can be interrupted—e.g., in a timesharing system—then different processes might complete at different times. To dynamically assign work to processes, we can employ a shared bag of tasks, as in the solution to the adaptive quadrature problem in Section 7.7. Here we present a matrix multiplication program that implements such a solution. The structure of the solution is illustrated in Figure 15.2. Figure 15.2. Replicated workers and bag of tasks As in the previous program, we employ two classes. The main class is identical to that in the previous section: it again creates a multiplier object, calls the object’s compute method, and then prints out results. The multiplier class is similar to that in the previous section in that it declares N , A , and B. It also declares and initializes W, the number of worker processes. The class declares an operation, bag , which is shared by the worker processes. The code in method compute sends each row index to bag . It then creates the worker processes, waits for them to terminate, and returns results to the invoker. Each worker process repeatedly receives a row index r from bag and computes N inner products, one for each element of row r of result matrix C . However, if the bag is empty, then the worker process notifies the compute method that it has completed and terminates itself. (See Exercises 15.5 and 15.6.) 216 Parallel Matrix Multiplication This way of detecting when to terminate works here because once the bag becomes empty, no new tasks are added to it; this way would not work for other problems where the bag might become empty before additional tasks are placed into it. For examples, see the adaptive quadrature example in Section 7.7, and the two solutions to the traveling salesman problem in Sections 17.2 and 17.3. This program should show nearly perfect speedup—over the one worker and one processor case—for reasonable-size matrices, e.g., when N is 100 or more. In this case the amount of computation per iteration of a worker process far outweighs the overhead of receiving a message from the bag. Like the previous 15.3 A Distributed Broadcast Algorithm 217 Figure 15.3. Broadcast algorithm interaction pattern program, this one uses caches well since JR stores matrices in row-major order, and each worker fills in an entire row of c . If the bag of tasks contained column indices instead of row indices, performance would be much worse because workers would encounter cache update conflicts. 15.3 A Distributed Broadcast Algorithm The program in the previous section can be modified so that the workers do not share the matrices or bag of tasks. In particular, each worker (or address space) could be given a copy of A and B , and an administrator process could dispense tasks and collect results (see Exercise 15.4). With these changes, the program could execute on a distributed-memory machine. This section and the next present two additional distributed algorithms. To simplify the presentation, we use processes, one to compute each element C[r] [c] . Initially each such process also has the corresponding values of A and B, i.e., A[r] [c] and B [r] [c] . In this section we have each process broadcast its value of A to other processes on the same row and broadcast its value of B to other processes on the same column. In the next section we have each process interact only with its four neighbors. Both algorithms are inefficient as given since the grain size is way too small to compensate for communication overhead. However, the algorithms can readily be generalized to use fewer processes, each of which is responsible for a block of matrix C (see Exercises 15.11 and 15.12). Our broadcast implementation of matrix multiplication uses three classes: a main class, a multiplier class, and a point class. The main class is identical to those in the previous sections. Instances of class Point carry out the computation. The multiplier class creates one instance for each value of C[r][c] . Each instance provides three public operations: one to start the computation, one to exchange row values, and one to exchange column values. Operation compute is serviced by a method; it is invoked by a send statement in the multiplier class and hence executes as a process. The arguments of the compute operation are references for other 218 Parallel Matrix Multiplication instances of Point. Operations rowval and colval are serviced by receive statements; they are invoked by other instances of Point in the same row r and column c, respectively. The instances of Point interact as shown in Figure 15.3. The compute process in Point first sends its value of Arc to the other instances of Point in the same row and receives their elements of A. The compute process then sends its value of Brc to other instances of Point in the same column and receives their elements of B. After these two data exchanges, Point ( r , c ) now has row r of A and column c of B. It then computes the inner product of these two vectors. Finally, it sends its value of Crc back to the multiplier class. 15.3 A Distributed Broadcast Algorithm 219 The multiplier class creates instances of Point and gets back a reference for each, which it stores in matrix pref . It then invokes the compute operations, passing each instance of Point references for other instances in the same row and column . We use pref[r] to pass row r of pref to compute . But, we must extract the elements in column c of pref and store them in a new array, cpref, which we then pass to compute . It then waits for all points to finish their computations and gathers the results, which it returns to the invoker. 220 Parallel Matrix Multiplication Figure 15.4. Heartbeat algorithm interaction pattern As noted, this program can readily be modified to have each instance of Point start with a block of A and a block of B and compute all elements of a block of C . It also can be modified so that the blocks are not square, i.e., strips can be used. In either case the basic algorithmic structure and communication pattern is identical. The program can also be modified to execute on multiple virtual machines: The multiplier class first creates the virtual machines and then creates instances of Point on them. 15.4 A Distributed Heartbeat Algorithm In the broadcast algorithm, each instance of Point acquires an entire row of A and an entire column of B and then computes their inner product. Also, each instance of Point communicates with all other instances on the same row and same column. Here we present a matrix multiplication algorithm that employs the same number of instances of a Point class. However, each instance holds only one value of A and one of B at a time. Also, each instance of Point communicates only with its four neighbors, as shown in Figure 15.4. Again, the algorithm can readily be generalized to work on blocks of points and to execute on multiple virtual machines. As in the broadcast algorithm, we will use processes, one to compute each element of matrix C. Again, each initially also has the corresponding elements of A and B. The algorithm consists of three stages [37]. In the first stage, processes shift values in A circularly to the left; values in row r of A are shifted left r 15.4 A Distributed Heartbeat Algorithm 221 columns. Second, processes shift values in B circularly up; values in column c of B are shifted up c rows. The result of the initial rearrangement of the values of A and B for a 3 × 3 matrix is shown in Figure 15.5. (Other initial Figure 15.5. Initial rearrangement of 3 × 3 matrices A and B rearrangements are possible; see Exercise 15.9.) In the third stage, each process multiplies one element of A and one of B , adds the product to its element of C, shifts the element of A circularly left one column, and shifts the element of B circularly up one row. This compute-and-shift sequence is repeated N-1 times, at which point the matrix product has been computed. We call this kind of algorithm a heartbeat algorithm because the actions of each process are like the beating of a heart: first send data out to neighbors, then bring data in from neighbors and use it. To implement the algorithm in JR, we again use three classes, as in the broadcast algorithm. Once again, the main class is identical to those in the previous sections. The computation is carried out by instances of a Point class, which pro- vides three public operations as in the broadcast algorithm. However, here the compute operation is passed references for only the left and upward neighbors, and the rowval and colval operations are invoked by only one neighbor. Also, the body of Point implements a different algorithm, as seen in the following. 222 Parallel Matrix Multiplication Method compute in the multiplier class creates instances of Point and passes each references for its left and upward neighbors. The compute method starts up the computation in the Point objects and gathers the results from all the points. Exercises 223 The prev method uses modular arithmetic so that instances of Point on the left and top borders communicate with instances on the right and bottom borders, respectively. Exercises 15.1 Determine the execution times of the programs in this chapter. To do so, place an invocation of System . currentTimeMillis () just before the computation begins and another just after the computation completes. The difference between the two values returned by this method is the time, in milliseconds, that the JR program has been executing. 15.2 Modify the prescheduled strip algorithm so that N does not have to be a multiple of PR . 15.3 Rewrite the MMMultiplier class in Section 15.1 so that the strip pro- cesses are declared as a family of processes using the process abbrevi- ation. Be sure your solution prevents the potential problem mentioned in the text; i.e., it prevents these processes from executing before instance variables have been initialized. 15.4 Change the bag of tasks program so that it does not use shared variables. 15.5 Suppose we change the code in the MMMultiplier class in Section 15.2 so that the compute method does not create the processes. Instead they are created using the process abbreviation: Is this new program correct? 15.6 Suppose we change the code in the MMMultiplier class in Section 15.2 so that the worker process executes the following code [...]... check_diffs stores the maximum difference of the elements in its row in diff[i][1] The code then updates local variable maxdiff, which contains the maximum of all the differences If this value is at most epsilon, we exit the loop and return the results 16.2 Prescheduled Strips The main loop in the algorithm in the previous section repeatedly creates numerous processes and then waits for them to terminate... with no public static variables The main class and the result class are identical to those in the previous section The Worker class implements the computation proper The Jacobi class creates PR instances of Worker and then starts the computation in each instance During each iteration of the computation, instances of Worker exchange the boundaries of their strip of the grid The Worker class provides three... Jacobi in Section 16.3 uses the length method Show how to rewrite the code in the following ways so as not to use the length method (a) Change the interface to the terminate operation and/or introduce another operation (b) Do not change the interface to the terminate operation Instead use the forward statement (c) Do not change the interface to the terminate operation or use the forward statment Instead... Computations that region Third, using the lists, propagate the largest label of any of the boundary pixels to the others that are on the boundary (The pixel in a region that has the largest label will be on its boundary.) Finally, propagate the label for each region to pixels in the interior of the region (g) Construct a set of experiments to compare the performance of the programs you wrote for previous... noop for the bottommost strip; this makes the corresponding send statements have no effect The Jacobi class provides one public method, compute, which controls the computation The compute method creates instances of Worker, starts the computation, checks for termination, gathers the results from the workers, and returns the overall result to the invoker 16.3 A Distributed Heartbeat Algorithm 239 The compute... responsible for S rows of the grid The solution also illustrates one way to implement a monitor [25] in JR Our program employs four classes: a main class, a barrier class, a Jacobi class, and a results class The results class is the same as in the previous section The main class is similar to the one in the previous section The differences are that it now also reads in PR, computes the strip size, and passes... termination Its gather operation is invoked by instances of Worker to return their part of the overall result The code that creates instances of Worker passes each worker its id, and appropriate values for the worker to initialize its part of the grid It saves references for the workers for use when starting the computation (The code assumes that PR > 1; see Exercise 16.12.) The code that starts the computation... only the first process executes the swap statement that switches the roles of the grids Variable iters is global to the processes so that it is accessible to the compute method The strip processes are not created using the process abbreviation for the same reason discussed in Section 15.1 (See Exercise 16 .7. ) To avoid cache update conflicts, each strip process uses a local variable to accumulate the. .. label the corresponding sub-image, and put the labeled sub-images into a second bag Other workers repeatedly take pairs of adjacent sub-images from the second bag, combine them into a larger labeled image, and put the combined image back into the second bag The computation terminates when the entire image has been properly labeled Implement the bags by means of operations that are shared by the worker... computation The main method then invokes the compute in the Jacobi object, gets back the results, and prints them out The Jacobi class provides the compute method This method is passed the grid size N, border values (left, top, right, and bottom), and the convergence criterion epsilon It initializes an array that contains old and new grid values and two variables that are used to index grid The current . see the adaptive quadrature example in Section 7. 7, and the two solutions to the traveling salesman problem in Sections 17. 2 and 17. 3. This program should show nearly perfect speedup—over the. the other instances of Point in the same row and receives their elements of A. The compute process then sends its value of Brc to other instances of Point in the same column and receives their. The main method then invokes the compute in the Jacobi object, gets back the results, and prints them out. The Jacobi class provides the compute method. This method is passed the grid size N,