Brain Circuit Bottom-Up Engine Simulation and Acceleration using IBM CELL Processor for Vision Applications

IBM CELL/B.E CHALLENGE 2007 Brain Circuit Bottom-Up Engine Simulation and Acceleration using IBM CELL Processor for Vision Applications Introduction Humans outperform computers on many natural tasks including vision Given the human ability to recognize objects rapidly and almost effortlessly, it is pragmatically sensible to study and attempt to imitate algorithms used by the brain Analysis of the anatomical structure and physiological operation of brain circuits has led to derivation of novel algorithms that, in initial study, successfully address issues of known difficulty in visual processing These algorithms are slow on conventional uniprocessors, but as might be expected of algorithms designed for highly parallel brain architectures, they are intrinsically parallel and lend themselves to efficient implementation across multiple processors This paper presents an implementation of such parallel algorithms on a CELL processor and demonstrates the potential to support the massive parallelism inherent in these algorithms by exploiting the CELL’s parallel instruction set and by further extending it to low-cost clusters built using the Sony PlayStation (PS3) The results show that a parallel implementation can achieve more than 100x performance improvement on a cluster of PS3 These initial findings, while highly promising in their own right, also provide a new platform enabling extended investigation of much larger-scale brain circuit models Early prototyping of such large-scale models has yielded evidence of their efficacy in recognition of time-varying, partially occluded, scale-invariant objects in arbitrary scenes Background The modeling exercises performed in our work focus on simulating a region sensitive to particular shapes of three line segments (line triples) due to some preliminary studies suggesting their utility, although two or four line segments could just as easily have been used (and indeed provide a subject for future study) These regions represent the crossover from line segment extraction (a subject well studied in contemporary machine vision) to detecting conjunctions of line segments, which are the first levels of the hierarchical representations of the neocortex Although our understanding of these hierarchies is still in its infancy, they are a crucial milestone to a successful implementation of the further downstream brain areas that operate on abstract concepts and perform decision making The most well understood encoding scheme used in the brain is that of the sparse population code (SC also called LSH) In this scheme, two shapes are only as similar as the number of activated axons they share in their neural activation pattern Because it is sparse, most neurons are inactive; and, this notion can be represented as a bit-vector with mostly zero entries This scheme has been observed to represent information such as joint position, movement intention, location, and most importantly the angle of visual shape orientation The neocortical computational module for which there exists the most agreement as to its operation is termed the competitive network The model is derived from the canonical local circuit of the neocortex Neurons activate as a result of input stimuli called receptive fields (RF), which are simply specific shapes for which neurons respond Roughly speaking, first, visual stimuli is sent to excitatory neurons, which activate at a level based on a neuron’s synapse (memory) These neurons represent the most basic elements, such as arrangements of line triples that are common to very many objects These neurons behave as shape detectors After a training period, these neurons no longer need to learn This training period corresponds to child development, a process during which half of all synapses die off, and presumably only the strongest remain Hence, synapses can be modeled as either present or absent, and each neuron’s level of activation can be modeled as a bit-vector In our model, we use 8192 neurons to represent this set, and term it RF1 Fig shows how this component (RF1 Compute) is utilized in the system Next, inhibitory neurons suppress some of these excitatory neuron activations, and can choose the best K neural matches, those with the highest activation levels This operation is performed by the K-Winner Take All (K-WTA) method, which our model incorporates (see the K-WTA block in Fig 1) Finally, these K winners activate an additional set of excitatory neurons, whose activations can be used in top-down processing to determine what type of objects are present in the field of view In our version, we utilize 1024 of these neurons, and term them RF2, to define a particular class of objects (see Fig 1’s RF2 Compute block) As new objects are viewed, these neurons learn Additional classes can be defined by additional sets of neurons In the actual brain, many classes of objects are recognizable, even new ones, and the amount of neurons used is much larger Application Analysis In this section we describe the systematic approach to mapping of the application onto the CELL processor A simple functional model of the application on IBM CELL is given in Figure (See Appendix for the pseudocode, and Figure for a detailed description of each stage) The bottom-up engine was executed on a 2.13 GHz Intel Core2 (E6400) CPU A fractional breakdown of execution time into functional bottom-up code blocks is shown in Figure Approximately 1.89ms was required to execute the BU computation for a single line segment triple This corresponds to a BU computation throughput of about 526 line segment triples per second (526 LST/sec) In Figure we can observe that RF1 Vector computations and RF2 Vector computations take more that 95 % of the overall execution time These are the critical functions which need to be optimized and parallelized to increase the throughput of the BU computation From these results, the runtime of processing an entire video frame with 30,000 line-triples can be estimated at approximately one minute, crucially limiting the experiments that can be run on the system and restricting study of the downstream brain areas To enable experiments using interactive robots the recognition time of humans would need to be approached (approximately 500ms, thus requiring a speedup of about 120x) Optimization Other than the higher level parallel models on CELL, we implemented various programming optimizations to improve the performance, namely: • DMA alignment optimization: DMA operations in CELL can have a size of 1,2,4,8,16 bytes or multiples of 16 bytes If a particular transaction’s address crosses the 128 byte boundary, the results can be achieved through additional DMA transactions Hence by means of careful alignment of important data that is communicated regularly, the overall communication bandwidth required by the application can be significantly reduced If DMA alignment optimization is carried out on too many data sets, then it will result in significant wastage of precious SPE LS memory • Mailbox Vs signaling mechanism optimization: The CELL allows various means for communication and synchronization, like regular DMA operations, mailboxes and signaling mechanisms The mailbox mechanism allows 32 bit communication but takes about 0.1us (taking into account setup and various application code overhead) for each 32 bit transaction The signal communication mechanism allows 1-bit synchronization between the SPE and PPE at a much higher speed Hence, selection of this mechanism results in lower synchronization cost • Branch optimization: The SPE does not have a branch prediction unit and assumes a sequential program flow Thus pipeline stalls due to mispredicted branches can be very costly (on the order of 18 to 19 cycles) Various techniques such as function inlining, loop-unrolling and branch hinting mechanisms help to reduce the branch misprediction overhead • After the above optimization and using special 128-bit SIMD instructions such as absdiff, abs_sumb, spu_add, spu_cntb, See IBM(2005), one RF1 inner loop computation takes on an average 31.2 cycles (the desktop PC version takes 164 cycles) and RF2 inner loop takes about 246 cycles (the desktop PC version takes about 2879 cycles) Thus with superior 128-bit SIMD instruction set and SPE instruction fine-tuning impressive speedups are possible Performance • It should be noted that no special optimizations using SSE2 or MMX instructions were carried out on the desktop PC running Intel Core TABLE 1: Actual Performance of BDV on Desktop PC Versus PS3 with a single CELL Function name SC RF1 Process Input Partial to Global Histogram Generate Partial MidVector Partial to Global MidVector RF2 Process Input Dump Output Effective time per LST Speedup w.r.t serial Serial model on 2.13 GHz Intel Core2 (E6400) (us) 0.89 603.53 0.00 13.34 0.00 1276.68 0.60 1895.04 1.00 Optimized Parallel Model on Single CELL (PS3) (us) 0.80 13.78 2.71 2.48 1.91 12.91 0.75 35.34 53.63 Figure 1: This figure depicts the process of converting line-triple representations into detected shapes, with vectors depicted as circles and operations as horizontal lines with a centered symbol Step is the output of the thalamocortical and corticocortical loop processes (see text), which convert line segments into a somewhat scale invariant and translation invariant bit-vector representation The representation is a sparse encoding, 160 on-bits of 1280-bits, representing the angles to the line segment endpoints from some reference point for the line triple (e.g corner of bounding box) Slight changes in the orientation of the line segments, or movement of a line endpoint, lead to decreasing similarity between bit vectors derived from the naïve and modified shapes Row depicts the approximately 8,192 vectors of 1280-bits each (RF1 vectors), each of which was previously and maximally trained on a single input Row indicates that the dot-product will occur between the input vector and all RF1 vectors Row is the resulting matches (8,192 match values between and 160 each) Row depicts the application of a threshold for K-WTA operation, with K=512 Row shows the 512-hot of 8,192 bit-vector, which provides input into the next set of 1,000 vectors (RF2 vectors, 1024-2048 hot of 8,192, Row 8, trained on multiple inputs and performing more complex shape detection) Row indicates the dot-product operation between the input in Row with all RF2 vectors Row indicates the match values, ranging between and 512, indicating how well each RF2 vector matched the input These match values were cached to memory for each line-triple in a frame, typically numbering 20,000-90,000 for frames with 300-600 line segments Depending on downstream activity, different match values will suffice to support backward processes searching for particular shapes Figure 2: The hierarchical organization of shape detectors operating on the optical character number eight At the first level, line segments are detected similar to the early visual areas The second level is the portion of the model most studied in this work, which involves converting line triples into bitvectors via an approximation of thalamocortical and corticocortical loops, and two layers of feedforward competitive networks Layers and are under preliminary study and were implemented such that viewing a recognizable object allowed the activity of shape detectors in level to be predicted from the activity of other shape detectors in level Each individual grid square represents a shape processor Only a few are depicted as active in this example, but it could be possible for them to all be active simultaneously, performing the high degree of parallel processing responsible for the fast human visual recognition speed Each shape processor in level was modeled as detecting the same set of shapes, reducing memory requirements (A) Compute SC (B) RF1 Process Input (C) RF1 Activate Pattern (D) RF2 Process Input (E) Output Stage Figure 3: Functional Model of the Bottom-Up Engine without task level parallelism Figure 4: Performance comparison between different function blocks of the bottom-up engine on an Intel Core E6400 running at 2.13GHz Compute SC RF1 RF1 RF1 RF1 RF1 RF1 Process Process Process Process Process Process Input Input Input Input Input Input Global histogram generate RF1 RF1 RF1 RF1 RF1 RF1 Activate Process Process Process Process Process Pattern Input Input Input Input Input Partial to Global activation Pattern RF2 RF1 RF1 RF1 RF1 RF1 Process Process Process Process Process Process Input Input Input Input Input Input Figure 5: Parallel Model for Brain-derived Vision algorithm The shaded box corresponds to the serial portion of the code Figure 6: Sample output from the recognition of stool using WORD database running on PS3/IBM CELL References Ambros-Ingerson, J., Granger, R., and Lynch, G (1990) Simulation of paleocortex performs hierarchical clustering Science, 247: 1344-1348 Duda, R., Hart, P., Stork, D (2000) Pattern Classification John Wiley & Sons, Inc Carmichael, O (2003) Discriminative Techniques For The Recognition Of Complex-Shaped Objects PhD Thesis, The Robotics Institute, Carnegie Mellon University Technical Report CMU-RI-TR-03-34 Carmichael, O., Hebert, M WORD: Wiry Object Recognition Database Carnegie Mellon University; http://www.cs.cmu.edu/~owenc/word.htm retrieved Dececember 20, 2006 Coultrip, R., Granger, R., and Lynch, G (1992) A cortical model of winner-take-all competition via lateral inhibition Neural Networks, 5: 47-54 de Cock, E (2000) YAPI: application modeling for signal processing systems, DAC-2000, Eichenberger, A O’Brien, et al, (2005) Optimizing Compiler for a CELL Processor”, IEEE PACT 2005 Gschwind M (2006), Hofstee H.P, et al, “Synergistic Processing in CELL’s Multicore Architecture”, IEEE Computer 2006, 10-24 IBM (2005) CELL BE Programming Handbook and Programming Tutorial 10 Lee, E., Parks, T (1995) Dataflow Process Networks, Proceedings of the IEEE, May 1995 11 P D Kovesi MATLAB and Octave Functions for Computer Vision and Image Processing School of Computer Science & Software Engineering, The University of Western Australia; http://www.csse.uwa.edu.au/~pk/research/matlabfns/ retrieved December 2006 12 Perrett, D., Oram, M (1993) Neurophysiology of shape processing Imaging Vis Comput 11, 317–333 13 Rodriguez, A., Whitson, J., Granger, R (2004) Derivation and a nalysis of basic computational operations of thalamocortical circuits Journal of Cognitive Neuroscience, 16: 856877 14 Wallis, G., Rolls, E (1997) A model of invariant object recognition in the visual system Prog Neurobiol 51, 167–194 15 Wiskott, L Fellous, J., Kruger, N., Malsburg, C (1997) Face recognition by elastic bunch graph matching In Gerald Sommer, Kostas Daniilidis, and Josef Pauli, editors, Proc 7th Intern Conf on Computer Analysis of Images and Patterns, CAIP’97, Kiel, number 1296, pages 456–463, Heidelberg Springer-Verlag Appendix: Pseudo-code Initialize: load RF1 with 8192 entries of size 120 bits (ROM) load RF2 with 1024 entries of size 8192 bits (RAM) Capture Video Frame: extract 400 to 600 line segments (32 bits each) group 20,000 to 100,000 line triples (30 bits each) depending on scene complexity Calculate LSH: let lshVector be a locality sensitive hash (LSH) based on frame and line data RF1 Compute: //Computes the dot product of RF1 neurons with a //120 bit vector based on input frame data; //Additionally computes a threshold so that about 512 //RF1 neurons are active popRAM[] = array of 8192 elements of bits each sums[] = array of 121 elements of 13 bits each for i=1…8192 for j=1…20 popCount = popCount + Max(0, Abs(lshVector - RF1[i])) lshVector = lshVector >> RF1[i] = RF1[i] >> popRAM[i] = popCount sums[popCount] = sums[popCount] + K-WTA Compute: //Computes a vector indicating what RF1 neurons are //receptive i=120 while totalSum < 512 || i != totalSum = totalSum + sums[i ] threshold = i threshold2 = popCount / for i=1…8192 if popRAM[i] > threshold midVector[i] = else midVector[k] = RF2 Compute: //Computes dot products between the resultant //midVector and all RF2 neurons; Outputs the RF2 //index if the population count of the dot prodcut is //over threshold2 popCountROM[] = array of 2^8192 elements of 13 bits each //(ROM can be distributed to reduce size) for i=1…1024 popCount = popCountROM[midVector & RF2[i]] if popCount > currentMax currentMax = popCount if popCount >= threshold2 output index i output threshold2 output currentMax Evaluate Learning: input results from RF2 Compute analyze highly receptive neurons to determine if they should learn modify RF2 neurons (to be similar to frame data) based on learning RF2 Update: if applicable, send updates to RF2 as a result of learning ... the CELL processor A simple functional model of the application on IBM CELL is given in Figure (See Appendix for the pseudocode, and Figure for a detailed description of each stage) The bottom-up. .. Model for Brain- derived Vision algorithm The shaded box corresponds to the serial portion of the code Figure 6: Sample output from the recognition of stool using WORD database running on PS3 /IBM CELL. .. Compiler for a CELL Processor? ??, IEEE PACT 2005 Gschwind M (2006), Hofstee H.P, et al, “Synergistic Processing in CELL? ??s Multicore Architecture”, IEEE Computer 2006, 10-24 IBM (2005) CELL BE Programming

Định dạng
Số trang	7
Dung lượng	1,21 MB