Tài liệu tham khảo công nghệ thông tin A parallel implementation on modern hardware for geo-electrical tomographical software
Trang 1ĐẠI HỌC QUỐC GIA HÀ NỘITRƯỜNG ĐẠI HỌC CÔNG NGHỆ
Trang 2ĐẠI HỌC QUỐC GIA HÀ NỘITRƯỜNG ĐẠI HỌC CÔNG NGHỆ
Trang 3Geo-electrical tomographical software plays a crucial role in geophysicalresearch However, imported software is expensive and does not provide muchcustomizability, which is essential for more advanced geophysical study Besides,these programs are unable to exploit the full potential of modern hardware, so therunning time is inadequate for large-scale geophysical surveys It is therefore anessential task to develop domestic software for overcoming all these problems Thedevelopment of this software is based on our research in using parallel programmingon modern multi-core processors and stream processors for high performancecomputing While this project with its inter-disciplinary aspect poses many challenges,it has also enabled us to gain valuable insights in making scientific software andespecially the new field of personal supercomputing.
Trang 4INTRODUCTION 1
CHAPTER 1 HIGH PERFORMANCE COMPUTING ON MODERN
1.1 An overview of modern parallel architectures 3
1.1.1 Instruction-Level Parallel Architectures 41.1.2 Process-Level Parallel Architectures 5
1.2 Programming tools for scientific computing on personal desktop systems 15
1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks,
1.2.3 Heterogeneous programming and OpenCL 26
CHAPTER 2 THE FORWARD PROBLEM IN RESISTIVITY
2.1 Inversion theory 282.2 The geophysical model 302.3 The forward problem by differential method 35
3.1 CPU implementation 393.2 Example Results 403.3 GPU Implementation using CUDA 42
Trang 5List of Acronyms
CPU Central Processing Unit
CUDA Compute Unified Device ArchitectureGPU Graphical Processing Unit
OpenMP Open Multi ProcessingOpenCL Open Computing Language
TBB Intel Threading Building Blocks
Trang 6Geophysical methods are based on studying the propagation of the differentphysical fields within the earth’s interior One of the most widely used fields ingeophysics is the electromagnetic field generated by natural or artificial (controlled)sources Electromagnetic methods comprise one of the three principle technologies inapplied geophysics (the other two being seismic methods and potential field methods).There are many geo-electromagnetic methods currently used in the world Of theseelectromagnetic methods, resistivity tomography is the most widely used and it is ofmajor interest in our work.
Resistivity tomography [17] or resistivity imaging is a method used inexploration geophysics [18] to measure underground physical properties in mineral,hydrocarbon, ground water or even archaeological exploration It is closely related tothe medical imaging technique called electrical impedance tomography (EIT), andmathematically is the same inverse problem In contrast to medical EIT however,resistivity tomography is essentially a direct current method This method is relativelynew compared to other geophysical methods Since the 1970s, extensive research hasbeen done on the inversion theory for this method and it is still an active research fieldtoday A detailed historical description can be seen in [27]
Resistivity tomography surveys searching for oil and gas (left) or water (right)
1
Trang 7Resistivity tomography has the advantage of being relatively easy to carry outwith inexpensive equipment and therefore has seen widespread use all over the worldfor many decades
With the increasing computing power of personal computers, inversion softwarefor resistivity tomography has been made, most notably being Res2Dinv by Loke [5].
According to geophysicists at Institute of Geology (Vietnam Academy of Scienceand Technology), the use of imported resistivity software encountered the followingserious problems:
The user interface is not user-friendly;
Some computation steps cannot be modified to adapt to measurementmethods used in Vietnam;
With large datasets, the computational power of modern hardware is notfully exploit;
High cost for purchasing and upgrading software.
Resistivity software is a popular tool for both short term and long term projectsin research, education and exploration by Vietnamese geophysicists Replacingimported software is therefore essential not only to reduce cost but also to enable moreadvance research on the theoretical side, which requires custom softwareimplementations The development of this software is based on research in usingmodern multi-core processors and stream processors for scientific software This canalso be the basis for solving larger geophysical problems on distributed systems ifnecessary.
Our resistivity tomographical software is an example of applying highperformance computing on modern hardware to computational geoscience For 2-Dsurveys with small datasets, sequential programs still provide results in acceptabletime Parallelizing for these situations provides faster response time and thereforeincreases research productivity but is not a critical feature However, for 3-D surveys,datasets are much larger with high computational expenses A solution for thissituation is using clusters Clusters, however, are not a feasible option for manyscientific institutions in Vietnam Clusters are expensive with high powerconsumption With limited availability only in large institutions, getting access toclusters is also inconvenient Clusters are not suitable for field trip as well because ofdifficulties in transportation and power supply Exploiting the parallel capabilities of
Trang 8modern hardware is therefore a must to enable cost-effective scientific computing ondesktop systems for such problems This can help reduce hardware cost, powerconsumption and increase user convenience and software development productivity.These benefits are especially valuable to scientific software customers in Vietnamwhere cluster deployment is costly in both money and human resources.
3
Trang 9Chapter 1 High Performance Computing on Modern Hardware
1.1 An overview of modern parallel architectures
Computer speed is crucial in most software, especially scientific applications.As a result, computer designers have always looked for mechanisms to improvehardware performance Processor speed and packaging densities have been enhancedgreatly over the past decades However, due to the physical limitations of electroniccomponents, other mechanisms have been introduced to improve hardwareperformance.
According to [1], the objectives of architectural acceleration mechanisms are todecrease latency, the time from start to completion of an operation;increase bandwidth, the width and rate of operations.
Direct hardware implementations of expensive operations help reduceexecution latency Memory latency has been improved with larger register files,multiple register sets and caches, which exploit the spatial and temporal locality ofreference in the program.
With the bandwidth problem, the solutions can be classified into two forms ofparallelism: pipelining and replication.
Pipelining [22] divides an operation into different stages to enable theconcurrent execution of these stages for a stream of operations If all of the stages ofthe pipeline are filled, a new result is available every unit of time it takes to completethe slowest stage Pipelines are used in many kinds of processors In the picture below,a generic pipeline with four stages is shown Without pipelining, four instructions take16 clock cycles to complete With pipelining, this is reduced to just 8 clock cycles
On the other hands, replication duplicates hardware components to enableconcurrent execution of different operations
Pipelining and replication appear at different architectural levels and in variousforms complementing each other While numerous, these architectures can be dividedinto three groups [1]:
Instruction-Level Parallel (Fined-Grained Control Parallelism)Process-Level Parallel (Coarse-Grained Control Parallelism)
Trang 10Data Parallel (Data Parallelism)
These categories are not exclusive of each other A hardware device (such asthe CPU) can belong to all these three groups.
1.1.1 Instruction-Level Parallel Architectures
There are two common kinds of instruction-level parallel architecture.
The first is superscalar pipelined architectures which subdivide the execution
of each machine instruction into a number of stages As short stages allow for highclock frequencies, the recent trend is to use longer pipeline For example the Pentium 4uses a 20-stage pipeline and the latest Pentium 4 core contains a 31-stage pipeline
Figure 1 Generic 4-stage pipeline; the colored boxes represent instructions
independent of each other [21].
A common problem with these pipelines is branching When branches happen,the processor has to wait until the branch finishes fetching the next instruction Abranch prediction unit is put into the CPU to guess which branch would be executed.However, if branches are predicted poorly, the performance penalty can be high Someprogramming techniques to make branches in code more predictable for hardware canbe found in [2] Programming tools such as Intel VTune Performance Analyzer can beof great help in profiling programs for missed branch predictions.
The second kind of instruction-level parallel architecture is VLIW (very long
instruction word) architectures A very long instruction word usually controls 5 to 30
replicated execution units An example of VLIW architecture is the Intel Itaniumprocessor [23] As of 2009, Itanium processors can execute up to six instructions percycle For ordinary architectures, superscalar execution and out-of-order execution is
5
Trang 11used to speed up computing This increases hardware complexity The processor mustdecide at runtime whether instruction parts are independent so that they can beexecuted simultaneously In VLIW architectures, this is decided at compile time Thisshifts the hardware complexity to software complexity All operations in oneinstruction must be independent so efficient code generation is a hard task forcompilers The problem of writing compilers and porting legacy software to the newarchitectures make the Itanium architecture unpopular
1.1.2 Process-Level Parallel Architectures
Process-level parallel architectures are architectures that exploit coarse-grainedcontrol parallelism in loops, functions or complete programs They replicate completeasynchronously executing processors to increase execution bandwidth and, hence, fitthe multiple-instruction-multiple-data (MIMD) paradigm
Until a few years ago, these architectures comprised of multiprocessors andmulticomputers
A multiprocessor uses a shared memory address space for all processors Thereare two kinds of multiprocessors:
Symmetric Multiprocessor or SMP computers: the cost of accessing anaddress in memory is the same for each processor Furthermore, the processorsare all equal in the eyes of the operation system.
Non-uniform Memory Architecture or NUMA computers: the cost ofaccessing a given address in memory varies from one processor to another.In a multicomputer, each processor has its own local memory Access to remotememory requires explicit message passing over the interconnection network They arealso called distributed memory architectures or message-passing architectures Anexample is cluster system A cluster consists of many computing nodes, which can bebuilt using high-performance hardware or commodity desktop hardware All the nodesin a cluster are connected via Infiniband or Gigabit Ethernet Big clusters can havethousands of nodes with special topologies for interconnect Cluster is currently theonly affordable way for large scale supercomputing at the level of hundreds ofteraflops or more.
Trang 12Figure 2 Example SMP system (left) and NUMA system (right)
A recent derivative of cluster computing is grid computing [19] Whiletraditional clusters often consist of similar nodes close to each other, grids willincorporate heterogeneous collections of computers, possibly distributedgeographically They are, therefore, optimized for workloads containing manyindependent packets of work The two biggest grid computing network isFolding@home and SETI@home (BOINC) Both have the computing capability of afew petaflops while the most powerful traditional cluster can barely reach over 1petaflops
Figure 3 Intel CPU trends [12].
7
Trang 13The most notable change to process-level parallel architectures happened in thelast few years Figure 3 shows that although the number of transistors a CPU containsstill increases according to Moore’s law (which means doubling every 18 months), theclock speed has virtually stopped rising due to heating and manufacturing problems.CPU manufacturers have now turned to adding more cores to a single CPU while theclock speed stays the same or decreases An individual core is a distinct processingelement and is basically the same as a CPU in an older single-core PC A multi-corechip can now be considered a SMP MIMD parallel processor A multi-core chip canrun at lower clock speed and therefore consumes less power but still has increases inprocessing power.
The latest Intel Core i7-980 (Gulftown) CPU has 6 cores and 12 MB of cache.With hyper-threading it can support up to 12 hardware threads Future multi-core CPUgenerations may have 8, 16 or even 32 cores in the next few years These newarchitectures, especially in multi-processor node, can provide the level of parallelismthat has been only available to cluster systems.
Figure 4 Intel Gulftown CPU
1.1.3 Data parallel architectures
Data parallel architectures appeared very soon on the history of computing.They utilize data parallelism to increase execution bandwidth Data parallelism iscommon in many scientific and engineering tasks where a single operation is appliedto a whole data set, usually a vector or a matrix This allows applications to exhibit alarge amount of independent parallel workloads Both pipelining and replication havebeen applied to hardware to utilize data parallelism.
Trang 14Pipelined vector processors such as the Cray 1 [15], operates on vectors rather
than scalar After the instruction is decoded, vectors of data stream directly frommemory into the pipelined functional units Separate pipelines can be chained togetherto get higher performance The translation of sequential code into vector instructions iscalled vectorization A vectorizing compiler played a crucial role in programming forvector processors This has significantly pushed the maturity of compilers ingenerating efficient parallel code.
Through replication, processor arrays can utilize data parallelism as a single
control unit can order a large number of simple processing elements to operate thesame instruction on different data elements These massively parallel supercomputersfit into the single-instruction-multiple-data (SIMD) paradigm.
Although both of the kinds of supercomputers mentioned above have virtuallydisappeared from common use, they are precursors for current data parallelarchitectures, most notably the CPU SIMD processing and GPUs.
The CPU SIMD extension instruction set for Intel CPUs include MMX, SSE,SSE2, SSE3, SSE4 and AVX They allow the CPU to use a single operation to operateon several data elements simultaneously AVX, the latest extension instruction set isexpected to be implemented on both Intel and AMD products in 2010 and 2011 WithAVX, the size of SIMD vector register is increased from 128-bit to 256-bit, whichmeans the CPU can operate on 8 single-precision or 4 double-precision floating pointnumbers during one instruction CPU SIMD processing has been used widely byprogrammers in many applications such as multimedia and encryption and compilercode generation for these architectures are now considerably good Even whenmulticore CPUs are popular, understanding SIMD extensions is still vital foroptimizing program execution on each CPU core A good handbook on utilizingsoftware vectorization is [1].
However, graphics processing units (GPUs) are perhaps the hardware with themost dramatic growth in processing power over the last few years
Graphics chips started as fixed function graphics pipelines Over the years, thesegraphics chips became increasingly programmable with newer graphics API andshaders In the 1999-2000 timeframe, computer scientists in particular, along withresearchers in fields such as medical imaging and electromagnetic started using GPUsfor running general purpose computational applications They found the excellentfloating point performance in GPUs led to a huge performance boost for a range of
9
Trang 15scientific applications This was the advent of the movement called GPGPU or
General Purpose computing on GPUs
With the advent of programming languages such as CUDA and OpenCL, GPUsare now easier to program With the processing power of a few Teraflops, GPUs arenow massively parallel processors at a much smaller scale They are now also termedstream processors as data is streamed directly from memory into the execution unitswithout the latency like the CPUs As can be seen in Figure 5, GPUs have currentlyoutpaced CPUs many times in both speed and bandwidth
Figure 5 Comparison between CPU and GPU speed and bandwidth (CUDA
programming Guide) [8].
Trang 16The two most notable GPU architectures now are the ATI Radeon 5870(Cypress) and Nvidia GF100 (Fermi) processor
The Radeon 5870 processor has 20 SIMD engines, each of which has 16 threadprocessors inside of it Each of those thread processors has five arithmetic logic units,or ALUs With a total of 1600 stream processors and a clock speed of 850 MHz,Radeon 5870 has the single-precision computing power of 2.72 Tflops while top of theline CPU still has processing power counted in Gflops Double-precision computing isdone at one fifth of the rate for single-precision, at 544 Gflops This card supports bothOpenCL and DirectCompute The double version, the Radeon 5970 (Hemlock) dualgraphics processor has a single-precision computing power of 4.7 Tflops in a graphicscard at a thermal envelope of less than 300 W Custom over clocked versions made bygraphics card manufacturer can even offer much more computing power than theoriginal version.
Figure 6 ATI Radeon 5870 (Cypress) graphics processor
The Nvidia GF100 processor has 3 billion transistors with 15 SM (ShaderMultiprocessor) units, each has 32 shader cores or CUDA processor compared to 8 of
11
Trang 17previous Nvidia GPUs Each CUDA processor has a fully pipelined integer arithmeticlogic unit and floating point unit with better standard conformance and fused multiply-add instruction for both single and double precision The integer precision was raisedfrom 24 bit to 32 bit so multi-instruction emulation is no longer required Specialfunction units in each SM can execute transcendental instructions such as sin, cosine,reciprocal and square root
Trang 19Figure 7 Nvidia GF100 (Fermi) processor with parallel kernel execution
Single-precision performance of GF100 is about 1.7 Tflops but double-precisionperformance is only half at 800 Gflops, significantly better than the Radeon 5870.Previous architectures required that all SMs in the chip worked on the same kernel(function/program/loop) at the same time In this generation the GigaThread schedulercan execute threads from multiple kernels in parallel This chip is specifically designedto provide better support for GPGPU with memory error correction, native support for
Trang 20C++ (including virtual functions, function pointers, dynamic memory managementusing new and delete and exception handling), and compatible with CUDA, OpenCLand DirectCompute A true cache hierarchy with two levels is added with more sharedmemory than previous GPU generations Context switching and atomic operations arealso faster Fortran compilers are also available from PGI Specific versions forscientific computing will have from 3GB to 6GB GDDR5.
1.1.4 Future trends in hardware
Although the current parallel architectures are very powerful, especially forparallel workload, they won’t stay the same way in the future From the currentsituation, we can present some trends for future hardware in the next few years
The first is the change in the composition of clusters A cluster node can nowhave several multicore processors and some graphics processors Consequently,clusters with fewer nodes can still have the same processing power This also enablesthe maximum limit of cluster processing capabilities to increase Traditional clustersconsisting of only CPU nodes have virtually reached their peak at about 1 Pflops.Adding more nodes would result in more system overhead with marginal increase inspeed Electricity consumption is also enormous for such systems Supercomputingnow accounts for 2 percents of the total electric consumption of the entire UnitedStates Building supercomputer at the exascale (1000 Pflops) using traditional clustersis too much costly Graphics processors or similar architectures provide a good Gflops/W ratio and are, therefore, vital to building supercomputers with larger processingpower The IBM Roadrunner supercomputer [21] using Cell processors is a clearexample for this trend.
The second trend is the convergence of stream processors and CPUs.
Graphics cards currently act the role of co-processors to the CPU in floatingpoint intensive tasks In the long term, all the functionalities of the graphics card mayreside on the CPU, just like what happened in the case of math co-processors whichare now CPU floating point units The Cell processor by Sony, Toshiba and IBM isheading towards that direction AMD has also been continuously pursuing this with itsFusion project The Nvidia GF100 is a GPU with many CPU features such as memorycorrection and large caches The Intel’s Larrabee experiment project event wentfurther by aiming to produce an x86-compatible GPU that would later be integratedinto Intel CPUs These would all lead to a new kind of processor called AcceleratedProcessing Unit (APU)
15
Trang 21The third trend is the evolution of multicore CPUs into many-core processors inwhich individual cores form a cluster system In December 2009, Intel unveiled thenewest product of its Terascale Computing Research program, a 48-core x86processor
Figure 8 The Intel 48-core processor To the right is a dual-core tile The
processor has 24 such tiles in a 6 by 4 layout.
It represents the sequel to Intel's 2007 Polaris 80-core prototype that was based
on simple floating point units This device is called a "Single-chip Cloud Computer"(SCC) The structure of the chip resembles that of a cluster with cores connectedthrough a message-passing network with 256 GB/s bandwidth Shared-memory issimulated on software Cache coherence and power management is also software-based Each core can run its own OS and software, which resembles a cloudcomputing center Each tile (2 cores) can have its own frequency, and groupings offour tiles (8 cores) can each run at their own voltage The SCC can run all 48 cores atone time over a range of 25W to 125W and selectively vary the voltage and frequencyof the mesh network as well as sets of cores This 48 core device consists of 1.3 billiontransistors produced using 45nm high-k metal gate Intel are currently handing outthese processors to its partners in both industry and academy to enhance furtherresearch in parallel computing.
Tilera corporation is also producing processors with one hundred cores Eachcore can run a Linux OS independently The processor also has Dynamic DistributedCache technology which provides a fully coherent shared cache system across an
Trang 22arbitrary sized array of tiles Programming can be done normally on a Linux derivativewith full support for C and C++ and Tilera parallel libraries The processor utilizesVLIW (Very Long Instruction Word) with RISC instructions for each core Theprimary focus of this processor is for networking, multimedia and clouding computingwith a strong emphasis on integer computation to complement GPU’s floating pointcomputation.
From all these trends, it would be reasonable to assume that in the near future,we will be able to see new architectures which resemble all current architectures, suchas many-core processors where each core has a CPU core and stream-processors as co-processors Such systems would provide tremendous computing power per processorthat would cause major changes in the field of computing
1.2 Programming tools for scientific computing on personal desktopsystems
Traditionally, most scientific computing tasks have been done on clusters.However, with the advent of modern hardware that provide great level of parallelism,many small to medium-sized tasks can now be run on a single high-end desktopcomputer in reasonable time Such systems are called “personal supercomputers”.Although they have variable configurations, most today employ multicore CPUs withmultiple GPUs An example is the Fastra II desktop supercomputer [3] at University ofAntwep, Belgium, which can achieve 12 Tflops computing power The FASTRA IIcontains six NVIDIA GTX295 dual-GPU cards, and one GTX275 single-GPU cardwith a total cost of less than six thousands euros The real processing speed of thissystem can equal that of a cluster with thousands of CPU cores.
Although these systems are more cost-effective, consume less power andprovide greater convenience for their users, they pose serious problems for softwaredevelopers
Traditional programming tools and algorithms for cluster computing are notappropriate for exploiting the full potential of multi-core CPUs and GPUs There aremany kinds of interaction between components in such heterogeneous systems Thelink between the CPU and the GPUs is through the PCI Express bus The GPU has togo through the CPU to access system memory The inside of the multicore CPU is aSMP system As each GPU has separate graphics memory, the relationship betweenGPUs is like in a distributed-memory system As these systems are in early stages of
17
Trang 23development, programming tools do not provide all the functionalities programmersneed and many tasks still need to be done manually Algorithms also need to beadapted to the limitations of current hardware and software tools
In the following parts, we will present some programming tools for desktopsystems with multi-core CPUs and multi GPUs that we that we consider useful forexploiting parallelism in scientific computing The grouping is just for easycomparison between similar tools as some tools provide more than one kind ofparallelization.
1.2.1 CPU Thread-based Tools: OpenMP, Intel Threading Building Blocks, andCilk++
Windows and Linux (and other Unixes) provide API’s for creating andmanipulating operating system threads using WinAPI threads and POSIX threads(Pthreads), respectively These threading approaches may be convenient when there's anatural way to functionally decompose an application - for example, into a userinterface thread, a compute thread or a render thread.
However, in the case of more complicated parallel algorithms, the manualcreating and scheduling thread can lead to more complex code, longer developmenttime and not optimal execution
The alternative is to program atop a concurrency platform — an abstractionlayer of software that coordinates, schedules, and manages the multicore resources.
Using thread pools is a parallel pattern that can provide some improvements Athread pool is a strategy for minimizing the overhead associated with creating anddestroying threads and is possibly the simplest concurrency platform The basic idea ofa thread pool is to create a set of threads once and for all at the beginning of theprogram When a task is created, it executes on a thread in the pool, and returns thethread to the pool when finished A problem is when the task arrives and the pool hasno thread available The pool then suspends the task and wakes it up when a newthread is available This requires synchronization such as locks to ensure atomicity andavoid concurrency bugs Thread pools are common for the server-client model but forother tasks, scalability and deadlocks still pose problems.
This calls for concurrency platforms with higher levels of abstraction thatprovide more scalability, productivity and maintainability Some examples areOpenMP, Intel Threading Building Blocks, and Cilk++.
Trang 24OpenMP (Open Multiprocessing) [25] is an open concurrency platform with
support for multithreading through compiler pragmas in C, C++ and Fortran It is an
API specification and compilers can provide different implementations OpenMP isgoverned by the OpenMP Architecture Review Board (ARB) The first OpenMPspecification came out in 1997 with support for Fortran, followed by C/C++ support in1998 Version 2.0 was released in 2000 for Fortran and 2002 for C/C++ Version 3.0was released in 2008 and is the current API specification It contains many majorenhancements, especially the task construct Most recent compilers have added somelevel of support for OpenMP Programmers can inspect the code to find places thatrequire parallelization and insert the pragmas to tell the compiler to producemultithreaded code This makes the code have a fork-join model, in which when theparallel section has finished; all the threads join back the master thread Workloads inone loop are given to threads using work-sharing There are four kinds of loopworkload scheduling in OpenMP:
Static scheduling, each thread is given an equal chunk of iterations
Dynamic scheduling, the iterations are assigned to threads as the threadsrequest them The thread executes the chunk of iterations (controlled
through the chunk size parameter), then requests another chunk until there
are no more chunks to work on
Guided scheduling is almost the same as dynamic scheduling, except that
for a chunk size of 1, the size of each chunk is proportional to the number
of unassigned iterations, divided by the number of threads, decreasing to
1 For a chunk size of “k” (k > 1), the size of each chunk is determined in
the same way, with the restriction that the chunks do not contain fewer
than k iterations.
Runtime scheduling, if this schedule is selected, the decision regardingscheduling kind is made at run time The schedule and (optional) chunksize are set through the OMP_SCHEDULE environment variable.
19
Trang 25Beside scheduling clauses, OpenMP also has clauses for data sharing attribute,synchronization, IF control, initialization, data copying, reduction and other concurrentoperations
A typical OpenMP parallelized loop may look like:
#pragma omp for schedule(dynamic, CHUNKSIZE)for(int i = 2; i <= N-1; i++)
for(int j = 2; j <= i; j++)
for(int k = 1; k <= M; k++)
Figure 9 OpenMP fork-join model.
Intel’s Threading Building Blocks (TBB) [4] is an open source C++ templatelibrary developed by Intel for writing task based multithreaded applications with ideasand models inherited from many previous languages and libraries While OpenMPuses the pragma approach for parallelization, TBB uses the library approach The firstversion came out in August 2006 and since then TBB has seen widespread use inmany applications, especially game engines such as Unreal TBB is available withboth a commercial license and an open source license The latest version 2.2 wasintroduced in August 2009 The library has also received Jolt Productivity award andInfoWorld OSS award.
It is a library based on generic programming, requires no special compilersupport, and is processor and OS independent This makes TBB ideal for parallelizinglegacy applications TBB has support for Windows, Linux, OS X, Solaris, PowerPC,
Trang 26Xbox, QNX, FreeBSD and can be compiled using Visual C++, Intel C++, gcc andother popular compilers.
TBB is not a thread-replacement library but provides a higher level ofabstraction Developers do not work directly with threads but tasks, which are mappedto threads by the library runtime The number of threads are automatically managed bythe library or set manually by the user, just like the case with OpenMP Beside basic
loop parallelizing parallel_for constructs, TBB also have parallel patterns such as
parallel_reduce, parallel_scan, parallel_do; concurrent data containers including
vectors, queues and hash map; scalable memory allocator and synchronizationprimitives such as atomics and mutexes; and pipelining TBB parallel algorithms
operate on the concept of blocked_range, which is the iteration space Work is dividedbetween threads using work stealing, in which the range is recursively divided into
smaller tasks A thread work on the task it meets depth first and steals the task breadthfirst following the principles:
do task from own queue (FIFO)steal task from another queue.
This can be applied to one, two or three-dimensional ranges, allowing foreffective blocking on data structures with dimensions greater than one Task stealingalso enables better cache utilization and avoid false sharing as much as possible Thefigures below illustrate the mechanism behind task stealing
21
Trang 27Figure 10 TBB’s range split
Figure 11 TBB’s task stealing illustration
Normally, TBB calls using function objects can be quite verbose asparallelization is done through libraries However, with the new C++0x lambda syntax,TBB code can be much shorter and easier to read.
Below is an example TBB call using C++0x lambdas:
void ParallelApplyFoo(float a[], size_t n ){ parallel_for( blocked_range<size_t>( 0, n ),
[=](const blocked_range<size_t>& range) {for(int i= range.begin();i!=range.end();++i)
Foo(a[i]);},
Trang 28auto_partitioner() );}}
Another platform for multi-threading development is Cilk++ Cilk++ extends
the C++ programming language with three new keywords: cilk_spawn, cilk_sync,
cilk_for; and Cilk++ hyper objects which can be global variables but avoid data races.
Parallelized code using Cilk++ can therefore be quite compact.
void matrix_multiply(matrix_t A,matrix_t B,matrix_t C){
cilk_for (int i = 0; i < A.rows; ++i) {
cilk_for (int j = 0; j < B.cols; ++j) {
for (int k = 0; k < A.cols; ++k)C[i][j] += A[i][k] * B[k][j];}
Matrix multiplication implementation in Cilk++
However, this requires the Cilk++ compiler which extends on a standard C++compiler such as Visual C++ or GCC Cilk++ also uses work stealing like TBB forscheduling threads but with its own modification Both are based on the work stealingmethod by the MIT Cilk project Intel has recently acquired the Cilk++ company andCilk++ will be integrated into Intel’s line of parallel programming tools together withTBB and other tools.
1.2.2 GPU programming with CUDA
The first GPU platform we present is CUDA [16] CUDA is Nvidia’s parallelcomputing architecture that allows programmer to program GPGPU applications onNvidia’s graphics card architecture including Geforce, Quadro and Tesla CUDA wasintroduced in the late 2006 and received a lot of attention when released, especiallyfrom high performance computing communities It is now the basis for all other kindsof parallel programming tools on the Nvidia hardware CUDA is now the most populartool for programming GPGPU applications with various research papers, libraries andcommercial development tools Previous to CUDA, GPGPU was done using graphics
23
Trang 29API Although speedup for this was great, there were several drawbacks that limits thegrowth of GPGPU Firstly, the programmer is required to possess in-depth knowledgeof graphics APIs and GPU architecture Secondly, problems had to be expressed interms of vertex coordinates, textures and shader programs, which results in greatlyincreased program complexity, low productivity and poor code maintenance Thirdly,basic programming features such as random reads and writes to memory were notsupported, greatly restricting the programming model and algorithms available CUDAhas to some extent made GPU programming more accessible to programmers with amore intuitive programming model and a good collections of libraries for populartasks.
Figure 12 CUDA processing flow [16]
As CUDA is used for an architecture which supports massively data parallelprogramming, programming using CUDA is also different from traditionalprogramming on the CPU The processing flow for CUDA applications is shown inFigure 12.
As the GPU does not have direct access to main memory, processing data mustbe copied from the main memory to GPU memory through the PCI Express lanes TheGPU is also not a totally self-control processor and needs the CPU to instruct theprocessing The GPU then executes the data processing in parallel on its core using itshardware scheduler When the processing is done, processed data is copied back themain memory All the memory copy operations are much more expensive than