ContentsPreface 2Introduction 51 Delaunay Tessellation and Convex Hull 91.1 Geometric Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Delaunay Tessellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2.1 Definition of Delaunay Tessellation . . . . . . . . . . . . . . . . 121.2.2 Properties of Delaunay Tessellation . . . . . . . . . . . . . . . . 141.3 Delaunay Tessellation and Connection to Convex Hull . . . . . . . . . . 192 Graham’s Algorithm 232.1 Pseudocode, Version A . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.1.1 Start and Stop of Loop . . . . . . . . . . . . . . . . . . . . . . . 242.1.2 Sorting Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1.3 Collinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Pseudocode, Version B . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Implementation of Graham’s Algorithm . . . . . . . . . . . . . . . . . . 272.3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . 282.3.2 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.3.3 Main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.3.4 Code for the Graham Scan . . . . . . . . . . . . . . . . . . . . . 322.3.5 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Algorithms for Computing Delaunay Tessellation 353.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2 Parallel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3 Correctness and Implementation of the Parallel Algorithm . . . . . . . 43
VIETNAM NATIONAL UNIVERSITY, HANOI HANOI UNIVERSITY OF SCIENCE FACULTY OF MATHEMATICS MECHANICS INFORMATICS ———–OOO———— DONG VAN VIET A Parallel Algorithm Based on Convexity for the Computing of Delaunay Tessellation THESIS Major : COMPUTATIONAL GEOMETRY Instructor: PROF PHAN THANH AN HA NOI, 2012 Preface Contents Preface Introduction Delaunay Tessellation and Convex Hull 1.1 Geometric Preliminaries 1.2 Delaunay Tessellation 12 1.2.1 Definition of Delaunay Tessellation 12 1.2.2 Properties of Delaunay Tessellation 14 Delaunay Tessellation and Connection to Convex Hull 19 1.3 Graham’s Algorithm 2.1 23 Pseudocode, Version A 24 2.1.1 Start and Stop of Loop 24 2.1.2 Sorting Origin 25 2.1.3 Collinearities 26 2.2 Pseudocode, Version B 27 2.3 Implementation of Graham’s Algorithm 27 2.3.1 Data Representation 28 2.3.2 Sorting 29 2.3.3 Main 31 2.3.4 Code for the Graham Scan 32 2.3.5 Complexity 32 Example 33 2.4 Algorithms for Computing Delaunay Tessellation 35 3.1 Sequential Algorithm 35 3.2 Parallel Algorithm 38 3.3 Correctness and Implementation of the Parallel Algorithm 43 3.4 Concluding Remarks and Open Problems Appendix 49 51 Introduction to MPI Library 51 Getting Started With MPI on the Cluster 51 Compilation 51 Running MPI 52 The Basis of Writing MPI Programs 52 Initialization, Communicators, Handles, and Clean-Up 52 MPI Indispensable Functions 53 A Simple MPI Program - Hello.c 57 Timing Programs 59 Debugging Methods 60 References 61 Introduction Computational geometry is a branch of computer science concerned with the design and analysis of algorithms to solve geometric problems (such as pattern recognition, computer graphics, operations research, computer-aided design, robotics, etc) that require real-time speeds Until recently, these problems were solved using conventional sequential computer, computers whose design more or less follows the model proposed by John von Neumann and his team in the late 1940s (see [1]) The model consists of a single processor capable of executing exactly one instruction of a program during each time unit Computers built according to this paradigm have been able to perform at tremendous speeds, thanks to inherently fast electronic components However, it seems today that this approach has been pushed as far as it will go, and that the simple laws of physics will stand in the way of further progress For example, the speed of light imposes a limit that cannot be surpassed by any electronic device On the other hand, our appetite appears to grow continually for ever more powerful computers capable of processing large amounts of data at great speeds One solution to this predicament that has recently gained credibility and popularity is parallel processing The main purpose of parallel processing is to perform computations faster than can be done with a single processor by using a number of processors concurrently The pursuit of this goal has had a tremendous influence on almost all the activities related to computing The need for faster solutions and for solving larger-size problems arises in a wide variety of applications Three main factors have contributed to the current strong trend in favor of parallel processing (see [12]) First, the hardware cost has been falling steadily; hence, it is now possible to build systems with many processors at a reasonable cost Second, the very large scale integration circuit technology has advanced to the point where it is possible to design complex systems requiring millions of transistors on a single chip Third, the fastest cycle time of a von Neumann-type processor seems to be approaching fundamental physical limitations beyond which no improvement is possible; in addition, as higher performance is squeezed out of a sequential processor, the associated cost increases dramatically All these factors have pushed researchers into exploring parallelism and its potential use in important applications A parallel computer is simply a collection of processors, typically of the same type, interconnected in a certain fashion to allow the coordination of their activities and the exchange of data (see [12]) The processors are assumed to be located within a small distance of one another, and are primarily used to solve a given problem jointly Contrast such computers with distributed systems, where a set of possibly many different types of processors are distributed over a large geographic area, and where the primary goals are to use the available distributed resources, and to collect information and transmit it over a network connecting the various processors Parallel computers can be classified according to a variety of architectural features and modes of operations In particular, these criteria include the type and the number of processors, the interconnections among the processors and the corresponding communication schemes, the overall control and synchronization, and the input/output operations In order to solve a problem efficiently on a parallel machine, it is usually necessary to design an algorithm that specifies multiple operations on each step, i.e., a parallel algorithm This algorithm can be executed a piece at a time on many different processors, and then put back together at the end to get the correct result As an example, consider the problem of computing the sum of a sequence A of n numbers The standard algorithm computes the sum by making a single pass through the sequence, keeping a running sum of the numbers seen so far It is not difficult, however, to devise an algorithm for computing the sum that performs many operations in parallel For example, suppose that, in parallel, each element of A with an even index is paired and summed with the next element of A, which has an odd index, i.e., A[0] is paired and with A[1], A[2] with A[3], and so on The result is a sequence of n/2 numbers that sum to the same value as the same that we wish to compute This pairing and summing step can be repeated until, after log2 n steps, a sequence consisting of a single value is produced, and this value is equal to the final sum As in sequential algorithm design, in parallel algorithm design there are many general techniques that can be used across a variety of problem areas, including parallel divide-and-conquer, randomization, and parallel pointer manipulation, etc The divideand-conquer strategy is to split the problem to be solved into subproblems that are easier to solve than the original problem solves the subproblems, and merges the solutions to the subproblems to construct a solution to the original problem Throughout this thesis, our main goal is to present a parallel algorithm based on a divide-and-conquer strategy for computing the n−dimensional Delaunay tessellation of a set of m distinct points in E n (see [12]) In E n , a Delaunay tessellation (i.e Delaunay triangulation in the plane) a long with its dual, the Voronoi diagram, is an important problem in many domains, including pattern recognition, terrain modeling, and mesh generation for the solution of partial differential equations In many of these domains the tessellation is a bottleneck in the overall computation, making it important to develop fast algorithms As a result, there are many sequential algorithms available for Delaunay tessellation, along with efficient implementations (see [14, 16]) Among others, Aurenhammer et al.’ method based on a beautiful connection between Delaunay tessellation and convex hull in one higher dimension (see [7, 9, 11]) Since these sequential algorithms are time and memory intensive, parallel implementation are important both for improved performance and to allow the solution of problems that are too large for sequential machines However, although several parallel algorithms for Delaunay triangulation have been presented (see [1]), practical implementations have been slower to appear (see [6, 8, 10, 13]) For the convex hull problem in 2D and 3D, we find the convex hull boundary in the domain formed by a rectangular (or rectangular parallelepiped) Then the domain is restricted to a smaller domain, namely, restricted area to a simple detection rather than a complete computation (see [2, 3, 5]) In this thesis, we present a parallel algorithm based on divide-and-conquer strategy At each process of parallel algorithm, the Aurenhammer et al.’s method (the liftup to the paraboloid of revolution) is used The convexity in the plane as a crucial factor of efficience of the new parallel algorithm over corresponding sequential algorithm is shown In particular, a restricted area obtained from a paraboloid given in [8] is used to discard non-Delaunay edges (Proposition 3.4) Some advantages of the parallel algorithm are shown Its implementation in plane is executed easily on PC clusters (Section 3.3) Compare with a previous work, the resulting implementation significantly achieves better speedups over corresponding sequential code given in [15] (Table 1) This thesis has chapters and one appendix: Chapter I Delaunay Tessellation and Convex hull We deals with basis geometric preliminaries, Delaunay tessellation notion and some properties of Delaunay tessellation This chapter shows a beautiful connection between Delaunay tessellation and convex hulls in one higher dimension Chapter II Graham’s Algorithm Chapter II is concerned with Graham’s scan to compute convex hull of a set of points in plane Chapter III Algorithms for Computing Delaunay Tessellation In this chapter, we come into contact with algorithms for computing Delaunay tessellation The program language uses in this thesis is C Appendix Introduction to MPI Library This guide is designed to give a brief overview of some of the basis and important routines of MPI Library Chapter Delaunay Tessellation and Convex Hull 1.1 Geometric Preliminaries The objects considered in Computational Geometry are normally sets of points in Euclidean space A coordinate system of reference is assumed, so that each point is represented as a vector of cartesian coordinates of the appropriate dimension The geometric objects not necessarily consist of finite sets of points, but must comply with the convention to be finitely specifiable So we shall consider, besides individual points, the straight line containing two given points, the straight line segment defined by its two extreme points, the plane containing three given points, the polygon defined by an (ordered) sequence or points, etc This section has no pretence of providing formal definitions of the geometric concepts used in this paper; it has just the objectives of refreshing notions that are certainly known to the reader and of introducing the adopted notation By Ed we denote the d−dimensional Euclidean space, i.e., the space of the d−tuples (x1 , , xd ) of real numbers xi , i = 1, , d with metric ( d 1/2 i=1 xi ) We shall now review the definition of the principal objects considered by Computational Geometry Point: A d−tuple (x1 , , xd ) denotes a point p of E d ; this point may be also interpreted as a d−component vector applied to the origin of E d , whose free terminus is the point p Line: Given two distinct points q1 and q2 in E d , the linear combination αq1 + (1 − α)q2 (α ∈ R) is a line in E d Line segment: Given two distinct points q1 and q2 in E d , if in the expression αq1 + (1 − α)q2 we add the condition α 1, we obtain the convex combination of q1 and q2 , i.e., αq1 + (1 − α)q2 (α ∈ R, α 1) This convex combination describes the straight line segment joining the two points q1 and q2 Normally this segment is denoted as q1 q2 (unordered pair) Convex set: A domain D in E d in convex if, for any two points q1 and q2 in D, the segment q1 q2 is entirely contained in D In formula form, we the following definition: Definition 1.1 Given k distinct points p1 , p2 , , pk in E d , the set of points p = α p + α p2 + · · · + α k pk (αj ∈ R, αj 0, α1 + α2 + · · · + αk = 1) is the convex set generated by p1 , p2 , , pk , and p is a convex combination of p1 , p2 , , pk Figure 1.1 a) Convex set, b) nonconvex set It should be clear from Fig.1.1 that any region with a ”dent” is not convex, since two points stradding the dents can be found such that the segment they determine contains points exterior to the region Convex hull: The convex hull of a set of points S in E d is the boundary of the smallest convex domain in E d containing S In mathematics literature, the convex hull of set S by CH(S) (see Fig.1.2) Figure 1.2 Convex hull of finite set 10 Proof Let u, v ∈ P , ux1 < xmin , vx1 > xmax (see Fig.3.4) Assume the contrary that uv is a Delaunay edge By Proposition 3.3, the path H separates Delaunay triangles It follow that the strip {p ∈CH(P ) : xmin px1 xmax } splits CH(P ) into disjoint areas A = {q ∈CH(P ) : qx1 < xmin } and B = {q ∈CH(P ) : qx1 > xmax } By the way, q ∗ , q ∗∗ ∈ H and they are extreme points of CH(P ) Because u ∈ A, v ∈ B and the convexity of CH(P ), the segment uv intersects a part of the path H bounded by q ∗ , q ∗∗ (see [18]) Hence, the segment uv intersects H The uv should include either some Delaunay edge of H or some point of P ∩ H By [14] and [16], uv is not a Delaunay edge, a contradiction Clearly, the strip {p ∈CH(P ) : xmin px1 xmax } splits D(P ) into disjoint areas and if (u, v) ∈ xmin xmax then u and v belong to two opposite sides of the strip Therefore, by Proposition 3.4, in case n = 2, we will use the condition DT2 (Pj∗ ) instead of DT(Pj∗ ) in parallel algorithm The sequential algorithm in Section 3.1 is implemented in C (the O(m4 ) code is given in [15]) Our parallel algorithm in Section 3.2 is written in MPI (Message Passing Interface) for C The codes are compiled and executed on IBM 1350 (Center for Hight Performance Computing, Hanoi University of Science) The 2D lower convex hull of P is determined by a code of Graham’s convex hull algorithm given in [15] The points q = (qx1 , qx2 ) is chosen such that the corresponding medial L along both x1 axis and x2 axis of all internal points of P For the comparison to be meaningful, both implementations use the same code for file reading Input (random points) Number of lower faces Sequential algorithm Parallel algorithm k = proc Parallel algorithm k = 10 proc 500 891 1.231672 0.736172 0.547753 800 1480 4.493053 2.185690 1.519047 1000 1884 8.538490 3.860057 2.489995 1200 2278 14.043461 8.899976 5.524225 3000 5970 213.648546 67.733744 39.774030 5000 Table 3.1 10075 989.834358 370.466784 217.712243 The actual run time of our parallel algorithm using restricted area in Section3.2 As we see from the Table 1, not like our parallel algorithm 7, the running time of the sequential algorithm in Section 3.1 is unacceptable Fig.3.5 shows the number of processes versus the total running time of parallel algorithm It also shows that the restricted area xmin xmax (included in DT2 (Pj∗ )) is necessary to reduce the total running time of our parallel algorithm 48 Figure 3.5 Number of processes with respect to total running time, Series1 (Series2, respectively), for the parallel algorithm finding Delaunay triangulation of 3,000 random points using the restricted area R\xmin xmax (without using the restricted area, respectively) 3.4 Concluding Remarks and Open Problems We observe that if the width of the strip {p ∈CH(P ) : xmin px1 xmax } is large, the number of pair u, v discarded by Proposition 3.4 reduces That is the reason why we can use H instead of the restricted area R2 \xmin xmax (i.e., if two points u, v belong to two opposite sides of the path H, then u, v does not form a Delaunay edge) In our parallel algorithm 7, we use one median line L parallel to the x2 axis We can also change it to a median line that parallel to the x1 axis Instead of using one median line, we can use two median line, one parallels to the x1 axis and one parallels to the x2 axis Then we could have a bigger restricted area, the number of pair u, v discarded by Proposition 3.4 increase The Graham’s convex hull algorithm in the implementation can be replaces by better convex hull algorithm ([3] or [5]) (if so, our parallel algorithm runs more faster) In Graham algorithm, we have to reorder the set of points P , this takes the longest time in finding the convex hull (O(n log n)) The points in P are sorted angularly form smallest to largest, how we can apply this order to discard non-Delaunay edges is a big question If we can use this order, then the time complexity of our parallel algorithm achieves significantly better speedups 49 Also, the O(m4 ) implementation of the sequential algorithm is Section 3.1 can be replaced by a O(m2 ) implementation given in [15], if so, we will obtain a corresponding parallel implementation in Section 3.2 But, as stated in [ 15, pp.186-187], if O(m4 ) is acceptable, the Delaunay triangulation can be computed with less than thirty lines of the O(m4 ) implementation This also leads to our explicit parallel implementation Furthermore, the importance of convexity for the efficiency of the parallel algorithm over the corresponding sequential one is also visible from this implementation That is the reason why in this paper, O(m4 ) implementation is our choice 50 Appendix Introduction to MPI Library MPI, the Message Passing Interface, is one of the most popular library for message within a parallel program MPI is a library and a software standard developed by the MPI Forum to make use of most attractive features of existing message passing systems for parallel programming An MPI process consists of a C, C++, or Fortran program which communicates with other MPI processes by calling MPI routines The MPI routines provide the programmer with a consistent interface across a wide variety of different platforms The MPI specification is based on a message passing model of computation where processes communicate and coordinate by sending and receiving messages These messages can be of any size, up to the physical memory limit of the machine MPI provides a variety of message passing options, offering maximal flexibility in message passing MPI is a specification ( like C of Fortran ) and there is a number of implementations The MPICH implementation is a library of several hundred C and Fortran routines that will let you write programs that run in parallel and communicate with each other There are about six to twenty four functions from which many useful and efficient programs can be written This guide is designed to give the user a brief overview of some of the basic and important routines of MPICH Getting Started With MPI on the Cluster Compilation MPI allow you to have your source code in any directory For convenience, you should probably put them together in subdirectories under ˜yoursername/mpi You can compile simple C programs that call MPI routines with: mpicc -o program_name program_name.c or 51 mpicc program_name.c -o program_name where program_name.c is the name of your C code file and program_name is your compiled file that you have after finishing compile Some examples of mpicc’s: mpicc -o hello hello.c mpicc hello.c -o hello Running MPI In order to run an MPI compiled program, you must type: mpirun -np [mpirun_options] where you specify the number of processores on which you want to run your parallel program, the mpirun options, and your program name and its expected arguments Some examples of mpirun’s: mpirun -np hello mpirun -np dt4 test.txt Type man mpirin or mpirun -help for a complete list of mpirun options The Basis of Writing MPI Programs You should now be able to successfully compile and execute MPI programs, check the status of your MPI process, and halt MPI programs that have gone astray This section gives an overview of the basics of parallel programming with MPI Initialization, Communicators, Handles, and Clean-Up The first MPI routine called in any MPI program must be the initialization routine MPI_Init Every MPI program must call this routine once, before any other MPI routines Making multiple calls to MPI_Init is erroneous MPI_Init defines something called MPI_COMM_WORLD for each process that calls it MPI_COMM_WORLD is a communicator All MPI communication calls require a communicator argument and MPI processes can only communicate if they share a communicator Every communicator contains a group which is a list of processes Secondly, a group is in fact local to a particular process The group contained within a communicator has been previously agreed across the processes at the time when the communicator was 52 set up The processes are ordered and numbered consecutively from zero, the number of each process being known as its rank The rank identifies each process within the communicator The group of MPI_COMM_WORLD is the set of all MPI processes MPI maintains internal data structures related to communicators etc and these are referenced by the user through handles Handles are returned to the user from some MPI calls and can be used in other MPI calls An MPI program should call the MPI routine MPI_Finalize when all communications have completed This routine cleans up all MPI data structures etc It does NOT cancel outstanding communications, so it is the responsibility of the programmer to make sure all communications have completed Once this routine is called, no other calls can be made to MPI routines, not even MPI_Init, so a process cannot later re-enroll in MPI MPI Indispensable Functions This section contains the basic functions needed to manipulate process running under MPI The following is a basic structure that are used to build most MPI programs + A C/C++ program that calls MPI library must include a header file mpi.h by routines: #include or #inlcude "mpi.h" + All MPI program must call MPI_Init as the first MPI call, to initialize themselves + Most MPI programs call MPI_Comm_size to get the number of processes that are running + Most MPI programs call MPI_Comm_rank to determine their rank, which is a number between and size n − + Condition process and general message passing can take place For example, using the calls MPI_Send and MPI_Recv + All MPI programs must call MPI_Finalize as the last call to an MPI library routine It is said that MPI is small and large What is meant is that the MPI standard has many functions in it, approximately 125 However, many of the advance routines represent functionality that can be ignored until one pursues added (data types), robustness (nonblocking send/ receive), effciency (”ready mode”), modularity (groups, communicators), or convenience (collective operations, topologies) MPI is said to be 53 small because there are six indispensable functions from which many useful and efficient programs can be written The six functions are: MPI_Init // Initialize MPI MPI_Comm_size // Find out how many processes there are MPI_Comm_rank // Find out which process I am MPI_Send // Send a message MPI_Recv // Recieve a message MPI_Finalize // Terminate MPI You can add functions to your working knowledge incrementally without having to learn everything at once For example, you can accomplish a lot by just adding the collective communication functions MPI_Bcast and MPI_Reduce to your repertoire These function will be detailed below in addition to the six indispensable functions a, MPI_Init The call to MPI_Init is required in every MPI program and must be the first MPI call It establishes the MPI execution environment int MPI_Init (&argc, &argv) Input: argc - Pointer to the number of arguments argv - Pointer to the argument vector b, MPI_Comm_size This routine determines the size (i.e., number of processes) of the group associated with the communicator given as an argument int MPI_Comm_size (MPI_Comm comm, int *size) Input: comm - communicator (handle) Output: size - number of processes in the group of comm c, MPI_Comm_rank The routine determines the rank (i.e., which process number am I ?) of the calling 54 process in the communicator int MPI_Comm_rank (MPI_Comm comm, int *rank) Input: comm - communicator (handle) Output: rank - rank of the calling process in the group of comm (integer) d, MPI_Send This routine performs a basic send; this routine may block until the message is received, depending on the specific implementation of MPI int MPI_Send (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Input: buf - initial address of send buffer (choice) count - number of elements in send buffer (nonnegative integer) datatype - datatype of each send buffer element (handle.) dest - rank of destination (integer) tag - message tag (integer) comm - communicator (handle) e, MPI_Recv This routine performs a basic receive int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Input: count - maximun number of elements in receive fuffer (integer) datatype - datatype of each receive fuffer element (handle.) source - rank of source (integer) comm - communicator (handle.) 55 Output: buf - initial address of receive buffer status - status object, provides information about message received; status is a structure of type MPI_Status, the element status MPI_SOURCE is the source of the message received, and the element status MPI_TAG is the tag value f, MPI_Finalize This routine terminates the MPI execution environment; all processes must call this routine before exiting int MPI_Finalize (void) g, MPI_Bcast This routine broadcasts data from the process with rank root to all processes of the group int MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) Input/Output: buffer - starting address of buffer (choice) count - number of entries in buffer (integer) datatype - data type of buffer (handle) root - rank of broadcast root (integer) comm - communicator (handle) h, MPI_Reduce This routine combines values on all processes into a single value using the operation defined by the parameter op int MPI_Reduce (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) Input: sendbuf - address of send buffer (choice) count - number of elements in send buffer (integer) datatype - data type of elements of send budder (handle) 56 op - reduce operation (handle) comm - communication (handle) Output: recvbuf - address of receive buffer (choice, sugnificant only at root) A Simple MPI Program - Hello.c Consider this demo program: /*The Parallel Hello World Program*/ #include #include int main (int argc, char **argv) { int rank; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank); printf ("Hello World form Node %d \n", rank); MPI_Finalize(); } To compile and execute this demo, we use the following commands: mpicc -o hello hello.c mpirun -np 10 hello In the nutshell, this program sets up a communication group of processes, where each process gets its rank, prints it, and exits It is important for you to understand that in MPI, this program will start simultaneously on all machines For example, if we had ten machines, the running this program would mean that ten separate instances of this program would start running together on the different machines This is a fundamental difference from ordinary C programs, where, when someone said ”run the program”, it was assumed that there was only one instance of the program running The first line, #include Should be familiar to all C programmers It includes the standard input/output routines like printf The second line, 57 #include Includes the MPI functions The file mpi.h contains prototypes for all the MPI routines in this program; this file is located in usr/include/mpi/mpi.h in case you actually want to look at it The program starts with the main line which takes the usual two arguments argc and argv, and the program declares one integer variable, node The first step of the program, MPI_Init (&argc, &argv); Calls MPI_Init to initialize the MPI environment, and generally set up everything This should be the first command executed in all programs This routine takes pointers to argc and argv, looks at them, pulls out the purely MPI_relevan things, and generally fixed them so you can use command line arguments as normal Next, program runs MPI_Comm_rank, passing it an address to rank MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_rank will set node to the rank of the machine on which the program is running These process will each receive a unique number from MPI_Comm_rank Because the program is running on multiple machines, each will execute not only all of the command thus far explained, but also the hello world message printf, which includes their own rank printf("Hello World from Node %d ",rank); If the program is run on ten computers, printf is called ten times on the different machines silmultaneously The order in which each process executes the message is undetermined, based on when they each reach that point in their execution of the program, and how they travel on the network So, the ten messages will get dumped to your screen in some undetermined order, such as: Hello World from Node Hello World from Node Hello World from Node Hello World from Node Hello World from Node Hello World from Node Hello World from Node Hello World from Node 58 Hello World from Node Hello World from Node Note that all the printf’s, though they come from different machines, will send their output intact to your shell window; this is generally true of output commands Input commands, like scanf, will only work on the process with rank zero After doing everything else, the program call MPI_Finalize, which generally terminates everything and shuts down MPI This should be the last command executed in all programs Timing Programs For timing parallel programs, MPI includes the routine MPI_Wtime() which returns elapsed wall clock time in seconds The timer has no defined starting point, so in order to time something, two calls are needed and the difference should be taken between the returned times As a simple example, we can time each of the process in the ”Hello World program” as below: /*Timing the Parallel Hello World Program*/ #include #include /*NOTE: The MPI_Wtime calls can be places anywhere between the MPI_Init and MPI_Finalize calls.*/ int main (int argc, char **argv) { int rank; double mytime; // declare a variable to hold the time returned MPI_Init (&argc, &argv); mytime = MPI_Wtime(); // get the time just before work to be timed MPI_Comm_rank (MPI_COMM_WORLD, &rank); printf ("Hello World form Node %d \n", rank); mytime = MPI_Wtime() - mytime; printf ("Timing from node %d is %lf seconds.\n",rank, mytime); MPI_Finalize(); } Run this code by commands: mpicc -o hello hello.c mpirun -np hello 59 And, what we may get is: Hello World form Node Timing from node is 0.000026 seconds Hello World form Node Timing from node is 0.000028 seconds Hello World form Node Timing from node is 0.000030 seconds Hello World form Node Timing from node is 0.000034 seconds Hello World form Node Timing from node is 0.000055 seconds Hello World form Node Timing from node is 0.000131 seconds Debugging Methods The following method is suggested for debugging MPI programs First, if possible, write the program as a serial program This will allow you to debug most syntax, logic, and indexing errors Then, modify the program and run it with 2-4 processes on the same machine This step will allow you to catch syntax and logic errors concerning intertask communication A common error found at this point is the use of non-unique message tags The final step in debugging your application is to run the same processes on different machines You should first try to find the bug using a few printf statements If some of these routines not run then you can find some where that the program doesn’t work So you can identify where the bug is 60 References [1] Akl, S.G., Lyons, K.A.(1992): ”Parallel Computational Geometry”, Prentice Hall, Englewodd Cliffs [2] An, P.T., Trang, L.H(2011): ”An efficient convex hull algorithm in 3D based on the method of orienting curves”, Optimization, pp.1-14 [3] An, P.T.(2010): ”Method of orienting curves for determining the convex hull of a finite set of points in the plane”, Optimization, 59(2), pp.175-179 [4] An, P.T, Giang, D.T., Hai, N.N.(2010): ”Some computational aspects of geodesic convex sets in a simple polygon”,Numer Funct Anal Optim., 31(3), pp.221-231 [5] An, P.T(2007): ”A modification of Graham’s algorithm for determining the convex hull of a finite planar set”, Ann Math Inf., 34, pp.269-274 [6] Atallah, M.J.(1995): ”Parallel computational geometry”, In: Zomaya, A.Y (ed.) Parallel and Distributed Computing Handbook, mcGraw-Hill, New York [7] Aurenhammer, F., Edelsbrunner, H.(1984): ”An optimal algorithm for constructing the weighted voronoi diagram in the plane”, Pattern Recogn, 17(2), pp.251-257 [8] Blelloch, G.E., Miller, G.L., Hardwick, J.C., Talmor, D.(1999): ”Design and implementation of a practical parallel Delaunay algorithm”, Algorithmica 24 (3 & 4), pp.243-269 [9] Brown, K.Q (1979): ”Voronoi diagrams from convex hulls”, Inf Process Lett., 9(5), pp.223-228 [10] Chen, M.-B., Chuang, T.-R, Wu, J.-J (2006): ”Parallel divide-and-conquer scheme for 2D Delaunay triangulation”, Concurrency and Computation: Practice and Experience 18, pp.1595-1612 [11] Edelsbrunner, H., Seide, R.(1986): ”Voronoi diagrams and arrangments”, Discrete Comput Geom 1(1), pp.25-44 61 [12] J´J´, J.(1992): ”Introduction to Parallel Algorithms”, Addison-Wesley, Reading a a (1992) [13] Kolingerovacutea, I., Kohout, J.(2002): ”Optimistic parallel Delaunay triangulation” Vis Comput 18(8), pp.511-529 [14] Okabe, A., Boots, B., Sugihara, K.(1992): ”Spatial Tessellations: Concepts and Applications of Voronoi Diagrams”, 1st edn Wiley, New York [15] O’Rourke, J.(1998): ”Computational Geometry in C”, 2nd edn Cambridge University Press, Cambirdge [16] Preparata, F.P., Shamos, M.I.(1988): ”Computational Geometry - An Introduction”, 2nd edn Springer, New York [17] Thomas, H.C., Charless, E.L., Ronald, L.R., Clifford, S.(2001): ”Introduction to Algorithms”, 3rd edn The MIT Press, London [18] Valentine, F.A.(1964): ”Convex Sets”, McGraw-Hill , New York 62 ... this thesis A two-dimensional Delaunay tessellation is called a Delaunay triangulation and an edge of a Delaunay triangulation is called a Delaunay edge (see Fig.1.6) A threedimensional Delaunay... present a parallel algorithm based on computing the convex hull of one higher dimension for computing the Delaunay tessellation Graham algorithm is only apply for computing convex hull of a set of. .. in parallel algorithm design there are many general techniques that can be used across a variety of problem areas, including parallel divide-and-conquer, randomization, and parallel pointer manipulation,