IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 257 BioThreads: A Novel VLIW-Based Chip Multiprocessor for Accelerating Biomedical Image Processing Applications David Stevens, Vassilios Chouliaras, Vicente Azorin-Peris, Jia Zheng, Angelos Echiadis, and Sijung Hu, Senior Member, IEEE Abstract—We discuss BioThreads, a novel, configurable, extensible system-on-chip multiprocessor and its use in accelerating biomedical signal processing applications such as imaging photoplethysmography (IPPG) BioThreads is derived from the LE1 open-source VLIW chip multiprocessor and efficiently handles instruction, data and thread-level parallelism In addition, it supports a novel mechanism for the dynamic creation, and allocation of software threads to uncommitted processor cores by implementing key POSIX Threads primitives directly in hardware, as custom instructions In this study, the BioThreads core is used to accelerate the calculation of the oxygen saturation map of living tissue in an experimental setup consisting of a high speed image acquisition system, connected to an FPGA board and to a host system Results demonstrate near-linear acceleration of the core kernels of the target blood perfusion assessment with increasing number of hardware threads The BioThreads processor was implemented on both standard-cell and FPGA technologies; in the first case and for an issue width of two, full real-time performance is achieved with cores whereas on a mid-range Xilinx Virtex6 device this is achieved with 10 dual-issue cores An 8-core LE1 VLIW FPGA prototype of the system achieved 240 times faster execution time than the scalar Microblaze processor demonstrating the scalability of the proposed solution to a state-of-the-art FPGA vendor provided soft CPU core Index Terms—Biomedical image processing, field programmable gate arrays (FPGAs), imaging photoplethysmography (IPPG), microprocessors, multicore processing I INTRODUCTION AND MOTIVATION B IOMEDICAL in-vitro and in-vivo assessment relies on the real-time execution of signal processing codes as a key to enabling safe, accurate and timely decision-making, allowing clinicians to make important decisions and perform Manuscript received December 20, 2010; revised May 03, 2011; accepted August 15, 2011 This work was supported by Loughborough University, U.K Date of publication November 04, 2011; date of current version May 22, 2012 This paper was recommended by Associate Editor Patrick Chiang D Stevens, V Chouliaras, and V Azorin-Peris are with the Department of Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K J Zheng is with the National Institute for the Control of Pharmaceutical and Biological Products (NICPBP), China, No.2, Tiantan Xili, Chongwen District, Beijing 100050, China A Echiadis is with Dialog Devices Ltd., Loughborough LE11 3EH, U.K S Hu is with the Department of Electronic and Electrical Engineering, Loughborough University, Leicestershire LE11 3TU, U.K (e-mail: s.hu@lboro.ac.uk) Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org Digital Object Identifier 10.1109/TBCAS.2011.2166962 medical interventions as these are based on hard facts, derived in real-time from physiological data [1], [2] In the area of biomedical image processing, a number of imaging methods have been proposed over the past few years including laser Doppler [3], optical coherence tomography [4] and more recently, imaging photoplethysmography (IPPG) [5], [6]; However, none of these techniques can attain their true potential without a real-time biomedical image processing system based on very large scale integration (VLSI) systems technology For instance, the quality and availability of physiological information from an IPPG system is directly related to the frame size and frame rate used by the system From a user perspective, the extent to which such a system can run in real-time is a key factor in its usability, and practical implementations of the system ultimately aim to be standalone and portable to achieve its full applicability This is an area where advanced computer architecture concepts, routinely utilized in high-performance consumer and telecoms systems-on-chip (SoC) [7], can potentially provide the required data streaming and execution bandwidth to allow for the real-time execution of algorithms that would otherwise be executed offline (in batch mode) using more established techniques and platforms (e.g., sequential execution on a PC host) A quantitative comparison in this study (results and discussion) illustrates the foreseen performance gains, showing that a scalar embedded processor is six times slower than the single-core configuration of our research platform Such SoC-based architectures typically include scalar embedded processor cores with a fixed instruction-set-architecture (ISA) which are widely used in standard-cell (ASIC) [8] and reconfigurable (FPGA)-based embedded systems [9] These processors present a good compromise for the execution of generalpurpose codes such as the user interface, low-level/bandwidth protocol processing, the embedded operating system (eOS) and occasionally, low-complexity signal processing tasks However, they lack considerably in the area of high-throughput execution and high-bandwidth data movement as is often required by the core algorithms in most signal processing application domains An interesting comparison of the capabilities of three such scalar engines targeting field-programmable technologies (FPGAs) is given in [10] To relieve this constraint, scalar embedded processors have been augmented with DSP coprocessors in both tightly-coupled [11] and loosely-coupled configurations [12] to target performance-critical inner loops of DSP algorithms A side-effect of this approach is the lack of homogeneity in the SoC platform 1932-4545/$26.00 © 2011 IEEE 258 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 programmer’s model which itself necessitates the use of complex ‘mailbox-type’ [13] communications and the programmermanaged use of multiple address spaces, coherency issues and DMA-driven data flows, typically under the control of the scalar CPU Another architectural alternative is the implementation of the core DSP functionality using custom (hardwired) logic Using established methodologies (register-transfer-level design, RTL) this task involves long development and verification times and results in systems that are of high performance yet, they are only tuned to the task at hand Also, these solutions tend to offer little or no programmability, making difficult their modification to reflect changes in the input algorithm In the same architectural domain, the synthesis of such hardwired engines from high level languages (ESL synthesis) is an area of active research in academia [14], [15] (academic efforts targeting ESL synthesis of Ada and C-descriptions); Industrial tools in this area have matured [16]–[18] (commercial offerings targeting C++, C and UML&C++) to the point of competing favorably with hand-coded RTL implementations, at least for certain types of designs [19] A potent solution to high performance VLSI systems design is provided by configurable, extensible processors [20] These CPUs allow the extension of their architecture (programmer model and ISA), and microarchitecture (execution units, streaming engines, coprocessors, local memories) by the system architect They typically offer high performance, full programmability and good post-fabrication adaptability to evolving algorithms through the careful choice of the custom ISA and execution/storage resources prior to committing to silicon High performance is achieved through the use of custom instructions which collapse data flow graph (DFG) sub-graphs (especially those repeated many times [21]) into one or more multi-input, multi-output (MIMO) instruction nodes At the same time, these processors deliver better power efficiency compared to non-extensible processors, via the reduction in the dynamic instruction count of the target application and the use of streaming local memories instead of data caches All the solutions to develop high-performance digital engines for consumer, and in this case, biomedical image processing mentioned so far suffer from the need to explicitly specify the software/hardware interface and schedule communications across that boundary This research proposes an alternative, all-software solution, based on a novel, configurable, extensible VLIW chip-multiprocessor (CMP) based on an open-source VLIW core [22]–[24] and targeting both FPGA and standard-cell (ASIC) silicon The VLIW architectural paradigm was chosen as such architectures efficiently handle parallelism at the instruction (ILP) and data (DLP) levels ILP is exploited via the static (compile-time) specification of independent RISC-ops (referred to as “syllables” or RISCops) per VLIW instruction whereas DLP is exploited via the compiler-directed unrolling and pipelining of inner loops (kernels) Key to this is the use of advanced compilation technology such as Trimaran [25] for fully-predicated EPIC architectures or VEX [26] for the partially-predicated LE1 CPU [27], the core element of the BioThreads CMP used in this work A third form of parallelism, thread level parallelism (TLP) can be explored via the instantiation of multiple such VLIW cores, operating in a shared-memory ecosystem The BioThreads processor addresses all three forms of parallelism and provides a unique hardware mechanism with which software threads are created and allocated to uncommitted LE1 VLIW cores via the use of custom instructions implementing key POSIX Threads (PThreads) primitives directly in hardware A Multithreaded Processors Multithreaded programming allows for the better utilization of the underlying multiprocessor system by splitting up sequential tasks such that they can be performed concurrently on separate CPUs (processor contexts) resulting in a reduction of the total task execution time and/or the better utilization of the underlying silicon Such threads are disjoint sections of the control flow graph that can potentially execute concurrently subject to the lack of data dependencies Multithreaded programming relies on the availability of multiple CPUs capable of running concurrently in a shared-memory ecosystem (multiprocessor) or as a distributed memory platform (multicomputer) Both multiprocessor and multicomputers fall in two major categories depending on how threads are created and managed: a) programmer-driven multithreading (potentially with OS software support), known as explicit multithreading and b) hardware generated threads (implicit multithreading) A very good overview of explicit multithreaded processors is given in [28] 1) Explicit Multithreading: Explicit multithreaded processors are categorized in: a) Interleaved multithreading (IMT) in which the CPU switches to another hardware thread at instruction boundaries thus, effectively hiding long-latency operations (memory accesses); b) Blocked multithreading (BMT) in which a thread is active until a long-latency operation is encountered; c) Simultaneous multithreading (SMT) which relies on a wide (ILP) pipeline to dynamically schedule operations across multiple hardware threads Explicit multithreaded architectures have the form of either chip multiprocessors (shared-memory ecosystem) or multicomputers (distributed memory ecosystem) In both cases, special OS thread libraries (APIs) control the creation of threads and make use of the underlying multicore architecture (if one is provided) or time-multiplex the single CPU core Examples of such APIs for shared-memory multithreading are the POSIX Threads (PThreads) and MPI for distributed memory multicomputers PThreads in particular allows for explicit creation, termination, joining and detaching of multiple threads and provides further support services in the form of mutex and conditional variables Notable machines supporting IMT include the HEP and the Cray MTA; More recent explicit-multithreading VLIW CMPs include amongst others the SiliconHive HIVEFLEX CSL2500 Communications processor (multicomputer architecture) [29] and the Fujitsu FR1000 VLIW media multicore (multiprocessor architecture) [30] In the academic world, the most notable offerings in the re-configurable/extensible VLIW domain include the tightly-coupled VLIW/datapath architecture [31] and the ADRES architecture [32] In the biomedical signal processing domain very few references can be found; A CMP architecture STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 259 based on a commercial VLIW core was used for the real-time processing of 12-lead ECG signals in [33] 2) Implicit Multithreading: Prior research in hardware managed threads (implicit multithreading) includes the SPSM and WELD architectures [34]–[36] The single-program speculative multithreading (SPSM) method uses fork and merge operations to reduce execution time Extra work by the compiler is required to find code blocks which are data independent; when such blocks are found, the compiler inserts extra instructions to inform the hardware to run the data independent code concurrently When the executing thread (master) reaches a fork instruction, a second thread is started at another location in the program Both threads then execute and when the master thread reaches the location in the program from which the second thread started, the two threads are merged together The WELD architecture uses branch prediction as a method of reducing the impact of pipeline restarts due to control flow changes Due to the organization of modern processors if a branch is taken this requires the pipeline to be restarted and the instructions in the branch shadow be squashed, resulting in wasted issue slots A way around this inefficiency is to run two or more threads concurrently and each thread to run the code if a branch is taken or not (thus following both control flow paths) Later on, when it is discovered whether a branch is definitely taken or not taken, the correct speculative thread is chosen (and becomes definite) whereas the incorrect thread is squashed This removes the need to re-fill the pipeline with the correct instructions as both branch paths are concurrently executed This method requires extra work by the compiler which introduces extra instructions (fork/bork) to inform the processor that it needs to run both branch paths as separate threads (execution bandwidth and thread handling) to a higher order system and moving towards the real-time execution of compute-bound biomedical signal processing codes b) The use of such a complex processing engine is advocated in the biomedical signal processing domain such as the real-time blood perfusion calculation Its inherent, multiparallel scalability allows for the real-time calculation of key computational kernels in this domain c) A unified, software-hardware flow has been developed so that all algorithm development takes place in the MATLAB environment, followed by automatic C-code generation and its introduction to the LE1 tool chain This is a well encapsulated process which ensures that the biomedical engineer is not exposed to the intricacies of real-time software development for a complex, multicore, SoC platform; at the same time this methodology results in a working embedded system directly implementing the algorithmic functionality specified in the MATLAB input description with minimum user guidance B The BioThreads CMP The BioThreads VLIW CMP is termed a hardware-assisted, explicit multithreaded architecture (software threads are user specified, thread management is hardware based) and is differentiated to offerings in that area via a) Its hardware PThreads primitives and b) its massive scalability which can range from a single-thread, dual-issue core to a theoretical maximum of K (256 contexts 16 hypercontexts) shared memory hardware threads in each of the maximum 256 distributed memory multicomputers, for a theoretical total of M threads, on up to 256-wide (VLIW issue slots) cores Clearly these are theoretical maxima as in such massively-parallel configurations, the latency of the memory system (within the same shared memory multiprocessor) is substantially increased, potentially resulting in sub-optimal single-thread performance unless aggressive compiler-directed loop unrolling and pipelining is performed C Research Contributions The major contributions of this research are summarized as follows: a) A configurable, extensible, chip-multiprocessor has been developed based on the open-source LE1 VLIW CPU, capable of performing key PThreads primitives directly in hardware This is a unique feature of the LE1 (and BioThreads) engine and uniquely differentiates it from other key research such as hardware primitives for remote memory access [37] In that respect, the BioThreads core can be thought of as a hybrid between an OS and a collection of processors, delivering services II THE BIOTHREADS ENGINE The BioThreads CMP is based on the LE1 open-source processor which it extends with execution primitives to support high speed image processing and dynamic thread allocation and mapping to uncommitted CPU cores The BioThreads architecture specifies a hybrid, shared-memory multiprocessor/distributed memory multicomputer The multiprocessor aspect of the BioThreads architecture falls in between the two categories (explicit and implicit) as it requires the user to explicitly identify the software threads in the code but at the same time, implements hardware support for the creation/management/synchronization/termination of such threads The thread management in the LE1 provides full hardware support for key PThread primitives such as pthread_create/join/exit and pthread_mutex_init/lock/trylock/unlock/destroy This is achieved with a hardware block, the thread control unit (TCU), whose purpose is the service of these custom hardware calls and the start and stop execution of multiple LE1 cores The TCU is an explicit serialization point which multiple contexts (cores) compete for access; PThreads command requests are internally serialized and the requesting contexts served in turn The use of the TCU removes the overhead of an operating system for the LE1 as low-level PThread services are provided in hardware; a typical pthread_create instruction completes in less than 20 clocks This is a unique feature of the LE1 VLIW CMP and the primary differentiator with other VLIW multicore engines Fig depicts a high level overview of the BioThreads engine The main components are the scalar platform, consisting of a service processor (the Xilinx Microblaze, 5-stage pipeline 32-bit CPU), its subsystem based on the CoreConnect [38] bus architecture, and finally, the LE1 chip multiprocessor (CMP) which executes the signal processing kernels Fig depicts the internal organization of a single LE1 context • The CPU consists of the instruction fetch engine (IFE), the execution core (LE1_CORE), the pipeline controller (PIPE_CTRL) and the load/store unit (LSU) The IFE can be configured with an instruction cache or alternatively, a closely-coupled instruction RAM (IRAM) These are accessed every cycle and return a long instruction word (LIW) consisting of multiple RISCops for decode and 260 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 Fig BioThreads Engine showing LE1 cores, memory subsystem and overall architecture depth; however, they maintain a common exception resolution point to support a precise exception programmer’s model • PIPE_CTRL is the primary control logic It is a collection of interlocked, pipelined state machines, which schedule the execution datapaths and monitor the overall instruction flow down the processing and memory pipelines PIPE_CTRL maintains the decoding logic and control registers of the CPU and handshakes the host during debug operations • The LSU is the primary path of the LE1_CORE to the system memory It allows for up to ISSUE_WIDTH (VLIW architectural width) memory operations per cycle and directly communicates with the shared data memory (STRMEM) The latter is a multibank, or 3-stage pipelined cross-bar architecture which scales reasonably well (in terms of speed and area) for up to clients, banks (8 8), as shown in Table I and Table II Note that the number of such banks and number of LSU clients (LSU_CHANNELS) are not necessarily equal, allowing for further microarchitecture optimizations This STRMEM block organization is depicted in Fig • Finally, to allow for the exploitation of shared-memory TLP, multiple processing cores can be instantiated in a CMP configuration as shown in Fig The figure depicts a dual-LE1, single-cluster BioThreads system interfacing to the common streaming data RAM III THE THREAD CONTROL UNIT Fig Open-source LE1 Core pipeline organization dispatch The IFE controller handles interfacing to the external memory for ICache refills and provides debug capability into the ICache/IRAM The IFE can also be configured with a branch predictor unit, currently based on the 2-bit saturating counter scheme (Smith predictor) in both set-associative and fully-associative (CAM-based) organization • The LE1_CORE block includes the main execution datapaths of the CPU There are a configurable number of clusters, each with its own register set Each cluster includes an integer core (SCORE), a custom instruction core (CCORE) and optionally, a floating point core (FPCORE) The integer and floating-point datapaths are of unequal pipeline Thread management and dynamic allocation to hardware contexts takes place in the TCU This is a set of hierarchical state machines, responsible for the management of software threads and their allocation to execution resources (HC, hypercontexts) It accepts PThreads requests from either the host or any of the executing hypercontexts It maintains a series of hardware (state) tables, and is a point of synchronization amongst all executing hypercontexts Due to the need to directly control the operating mode of every hypercontext (HC) while having direct access to the system memory, the TCU resides in the DEBUG_IF (Fig 1) where it makes use of the existing hardware infrastructure to stop, start, R/M/W memory and communicate with the host A critical block in thread management is the Context TCU which manages locally (per context, in the PIPE_CTRL block) the distribution of PThreads instructions to the centralized TCU Each clock, one of the active HCs in a context arbitrates for the use of the context TCU; When granted access, the command requested is passed on to the TCU residing in the DBG_IF for centralized processing Upon completion of the PThreads command, the Context TCU returns (to the requesting HC) the return values, as specified by that command Fig depicts the thread control organization in the context of a single shared-memory system The figure depicts a system containing contexts (0 through ); For simplicity, each context contains two hyperto contexts (HC0, HC1) and has direct access to the system-wide STRMEM for host-initiated DMA transfers and/or recovering the argument for void pthread_exit(void *value_ptr) The supported commands are listed in Table III STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 261 TABLE I BIOTHREADS REAL-TIME PERFORMANCE (DUAL-ISSUE LE1 CORES, FPGA AND ASIC) TABLE II BIOTHREADS REAL-TIME PERFORMANCE (QUAD-ISSUE LE1 CORES, FPGA AND ASIC) Fig Multibank streaming memory subsystem of the BioThreads CMP IV BIOMEDICAL SIGNAL PROCESSING APPLICATION: IPPG The application area selected to deploy the BioThreads VLIW-CMP engine for real-time biomedical signal processing was photoplethysmography (PPG), which is the measurement of blood volume changes in living tissue using optical means PPG is primarily used in Pulse Oximetry for the point-measurement of oxygen saturation In this application, PPG is implemented from an area measurement The basic concept of this implementation, known as imaging PPG (IPPG), is to illuminate the tissue with a homogeneous, nonionizing light source and to detect the reflected light with a 2D sensor array This yields a sequence of images (frames) from which a map (over the illuminated area) of the blood volume changes can Fig BioThreads CMP, two LE1 cores connected via the streaming memory system be generated, for subsequent extraction of physiological parameters The use of multiple wavelengths in the light source enables the reconstruction of blood volume changes at different depths of the tissue (due to the different penetration depth of 262 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 Fig Multibank streaming memory subsystem of the BioThreads CMP Fig Imaging PPG System Architecture TABLE III LE1 PTHREADS HARDWARE SUPPORT each wavelength), yielding a 3D map of the tissue function This is the principle of operation of real-time IPPG [39] Such functional maps have numerous applications in clinical diagnostics, including the assessment of the severity of skin burns or wounds, of cardiovascular surgical interventions and of overall cardiovascular function The overall IPPG system architecture is depicted in Fig A reflection–mode IPPG setup was deployed for the validation experiment in this investigation, the basic elements of which are a ring-light illuminator with arrays of red (660 nm) and infrared (880 nm) LEDS, a lens and a high sensitivity camera [40] as the detecting element, and the target skin tissue as defined by its optical coefficients and geometry The use of the fast digital camera enables the noncontact measurement at a sufficiently high sampling rate to allow PPG signal reconstruction from a large and homogeneously illuminated field of view at more than one wavelength, as shown in Fig The acquisition hardware synchronizes the illumination unit with the camera in order to perform multiplexed acquisition of a sequence of images of the area of interest For optimum signal quality during acquisition, the subject is seated comfortably and asked to extend their hand evenly onto a padded surface; and ambient light is kept to a minimum The IPPG system typically consists of three processing stages (pre, main, post) as shown in Fig Preprocessing comprises an optional image stabilization stage, where its use is largely dependent on the quality of the acquisition setting It is typically performed on each whole raw frame prior to the storage of raw images in memory and it can be implemented using a region of Fig Schematic diagram of IPPG setup including the dual wavelength LED ring light, lens, CMOS camera and subject hand interest of a fixed size, meaning that its processing time is a function of frame rate but is independent of raw image size The main processing stage comprises the conversion of raw data into the frequency domain, requiring a minimum number of frames, i.e., samples, to be performed Conversion to the frequency domain is performed once for every pixel position in the raw image, and the number of data points per second of time-domain data to convert is determined by the frame rate, meaning that the processing time for this stage is a function of both frame rate and size The postprocessing stage comprises the extraction of application-specific physiological parameters from time or frequency domain data and consists of operations such as statistical calculations and unit conversions (scaling and offsetting) of the processed data, which require relatively low processing power The ultimate scope of this experiment was to evaluate the performance of the BioThreads engine as a signal processing platform for IPPG, which was achieved by simplifying the processing workflow of the optophysiological assessment system [39] Having established that image stabilization is easily scalable as it is performed frame-by-frame on a fixed-size segment of the data, the preprocessing stage was disregarded for this STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 263 Fig Complete signal processing workflow in IPPG setup study The FFT cannot be performed point-by-point, and thus poses the most significant constraint to the scalability of the system The main processing stage was thus targeted in this study as the representative process of the system, and the resultant workflow consisted of the transformation of detected blood volume changes in living tissue to the frequency domain via FFT followed by extraction of physiological parameters for blood perfusion mapping By employing the principles relating to photoplethysmography (PPG), blood perfusion maps were generated from the power of the PPG signals in the frequency domain The acquired image frames were processed as follows: a) Two seconds worth of image data (60 frames of size 64 64 pixels at 8-bit resolution) were recorded with the acquisition system b) The average fundamental frequency of the PPG signal was manually extracted (1.4 Hz) c) Data were streamed to the BioThreads platform and the 64-point fast Fourier transform (FFT) of each pixel was calculated This was done by taking the pixel values of all image frames for a particular pixel position to form a pixel value vector in the time domain d) The Power of the FFT tap corresponding to the PPG fundamental frequency was copied into a new matrix at the same coordinates of the pixel (or pixel cluster) under processing In the presence of blood volume variation at that pixel (or pixel cluster) the power would be larger than if there was no blood volume variation Repeating (d) for all the remaining pixels (clusters) provides a new matrix (image) whose elements (pixels) depend on the detected blood volume variation power This technique allows the generation of a blood perfusion map, as a high PPG power can be attributed to high blood volume variation and ultimately to blood perfusion Fig illustrates in a simplified diagrammatic representation the algorithm discussed above, and Fig 10 illustrates the output frame after the full image processing stage The algorithm was prototyped in the MATLAB environment and subsequently translated to C using the embedded MATLAB compiler (emlc) for compilation on the VLIW CMP The emlc enables the output of C from MATLAB functions Using emlc, the function required is run with the example data and C code is generated by MATLAB In this example the data was a full dataset of two seconds worth of images in a one dimensional array This generated C code is a function which can be compiled and run in a handheld (FPGA-based) system Alongside this function a driver function was written to setup the input data and call the function The function computes the whole frame To be able to split this over multiple LE1 cores the code was modified to include a start and end value This Fig High-level view of the signal processing algorithm Fig 10 (A) Original image (B) Corresponding ac map (Mean AC) : : ( ) (C) Corresponding ac power map at heart rate (HR) = Hz F = Hz was a simple change which included altering the loop variables within the C code In MATLAB there was an issue exporting a function which used loop variables that were passed to the function These values are computed by the driver function and passed to the generated function Example of code (pseudocode): Generated by MATLAB: autogen_func(inputArray, outputArray); Altered to: autogen_func(inputArray, outputArray, start, end); Driver code: main() { for( ; ; loop++) { ; ; & & } } This way the number_of_threads constant can be easily altered and the code does not need to be rewritten/reexported Both the MATLAB generated function and the driver function are then compiled using the LE1 tool-chain to create the machine code to run in simulation as well as on silicon The system front-end of Fig 11 is implemented in LabVIEW and executes on the host system The acquisition panel is used 264 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 Fig 11 LabVIEW-based host application for hardware control (left) and analysis (right) to control the acquisition of frames and to send the raw imaging data to the FPGA-resident BioThreads engine for processing Upon completion, the processed frame is returned for display in the analysis panel, where the raw data is also accessible for further analysis in the time-domain V RESULTS AND DISCUSSION This section presents the results of a number of experiments when using the BioThreads platform for the real-time blood-volume change calculation These results are split into two major sections: A) Performance (real-time) results, pertaining to the real-time calculation of the blood perfusion map, and B) SoC platform results The latter include data such as area, maximum frequency, when targeting a Xilinx Virtex6 , 1-poly, 8-metal LX240T FG1156 [41] FPGA and a 0.13 (1P8M) standard-cell process It should be noted that the FPGA device is on a near-state-of-the-art silicon node (40 nm, TSMC) whereas the available standard-cell library in our research lab is rather old As such, the performance differential (300 MHz for the standard-cell compared to the 100 MHz for the FPGA target in Tables I and II) is certainly not representative of that expected when targeting a standard cell process at an advanced silicon node (40 nm and below) A Performance Results The 60 frames were streamed onto the target platform and the processors started executing the IPPG algorithms Upon completion, the computed frame was returned to the host system for display Table I shows the real execution time for the 2-wide LE1 system (VLIW CMP consisting of dual-static-issue cores) of the FPGA platform; ASIC results were obtained from simulation The results shown in the tables are arranged in columns under the following headings: • Config: The macroarchitecture of the BioThreads engine in LE1_CORES MEMORY_BANKS • LE1 Cores: Identifies the number of execution cores on the LE1 subsystem • Data Memory Banks: The number memory banks as depicted in Fig (and thus, the maximum number of concurrent load/store operations supported by the streaming memory system) This plays a major role in the overall system performance as will be shown below Fig 12 Speedup of BioThreads performance for 2-wide LE1 subsystem (FPGA and ASIC) The remaining three parameters have been measured on an FPGA platform (100 MHz LE1 subsystem and service processor, as shown on Fig 1) or derived from RTL simulations (ASIC implementation) Both sets of results were obtained without and with custom instructions for accelerating the FFT calculation • Cycles: The number of clock cycles taken when executing the core signal processing algorithm • Real time (sec): The real time taken to execute the algorithm (measured by the service processor for FPGA targets, calculated by RTL simulation for the ASIC target) • Speedup: The relative speedup of BioThreads configuration compared to the degenerate case of a single-context, single-bank, FPGA solution without the FFT custom instructions From a previous study on signal processing kernel acceleration (FFT) on the LE1 processor [42], it was concluded that the best performance is achieved with user-directed function in-lining, compiler-driven loop unrolling and custom instructions Using the above methods an 87% cycle reduction was achieved thus making possible the execution of the IPPG algorithm steps in real-time A subset of the speedup results of Table I (dual-issue performance) are plotted in Fig 12 Speedup was calculated in reference to the 1 reference configuration with no optimizations As shown, there is a maximum speed up of 125 on the standard-cell target and 41 on the FPGA target (ASIC implementation is faster compared to the FPGA), with full optimizations and custom instructions With respect to the number of memory channels, best performance is achieved when the number of cores in the BioThreads engine equals the number of memory banks This is expected as memory bank conflicts increase for Fig 13 shows the speedup values for the quad-issue configurations (subset of results from Table II) There is a similar trend in the shape of the graph as was seen in the dual-issue results and the same dependency on the number of memory banks is clearly seen The “Real Time (sec)” values in Tables I and II highlighted in grey identify the BioThreads configurations whose processing time is less than the acquisition time These 14 configurations are instances of the BioThreads processor that achieve real-time performance On an FPGA target, real-time is achieved with quad-core LE1’s and a 4/8-bank memory STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 265 TABLE IV LE1 COMPARISON WITH THE VEX VLIW/DSP Fig 13 Speedup BioThreads performance for 4-wide LE1 subsystem (FPGA and ASIC) system The FPGA device chosen (Virtex6 LX240T) can accommodate only such LE1 cores (along with the service processor system) and thus, can’t achieve the real-time constraint For the dual-issue configuration however, near-real-time performance is achieved with cores and a or 8-bank streaming memory (acquisition time of 2.00 s, processing of 2.85 s and 2.16 s respectively) Finally, full real-time is achieved cores memory banks These configurations can be accommodated on the target FPGA and thus are the preferred configurations of the BioThreads engine for this study On a rather old standard-cell technology, both dual and quadissue configurations achieve the required post-layout speed of 300 MHz In the first case, cores and a 2-bank memory system are sufficient; for quad-issue cores a core by 1-bank system is sufficient Configurations around the acquisition time of sec could be considered as viably ‘real-time’ as the subsecond time difference may not be noticeable in the clinical environment or be important in the final use of the processed data An interesting comparison for the FPGA targets is with the Microblaze processor provided by the FPGA vendor The latter is a 5-stage, scalar processor (similar to the ARM9 in terms of microarchitecture) with 32 K Instruction and Data caches Write-through configuration was selected for the data cache to reduce the PLB interconnect traffic The code was recompiled for the Microblaze using the gcc toolchain with –O3 optimizations and run in single-thread mode as that processor is not easily scalable; Running it in PThreads mode involves substantial OS intervention thus, the total runtime is longer than that for a single thread The application took 864 sec (0.07 fps) to execute on a 62.5 MHz FPGA board, which is 9.6 times slower than the dual-issue reference 100 MHz single-core LE1 (Table I, 1 configuration, 90.17 s) Extrapolating this figure to the LE1 clock of 100 MHz and assuming no memory subsystem degradation (an optimistic assumption), shows that the Microblaze processor is times slower than the reference LE1 configuration (0.11 fps); Finally, compared to the maximal dual-issue (configuration 8) of Table I with full optimizations, the scalar processor proved to be more than 240 times slower (525.00 s versus 2.16 s) A final evaluation of the performance of the BioThreads platform was carried out against the (simulated) VEX VLIW/DSP The VEX ISA is closely related to the Multiflow TRACE architecture and a commercial implementation of the ISA is the ST Microelectronics ST210 series of VLIW cores The VLIW ISA has good DSP support since it includes 16/32-bit multiplications, local data memory, instruction and data prefetching and is supported by a direct descendant of the Multiflow trace scheduling compiler The simulated VEX processor configuration included a 32 K, 2-way ICache and a 64 K, 4-way, write-through data cache whereas the BioThreads configuration used included a direct instruction and data memory system Both processors executed the single-threaded version of the workload as there was no library/OS support in VEX for the PThreads primitives included in the BioThreads core Table IV depicts the comparison in which it is clear that the local instruction/data memory system of the Biothreads core play an important role in the latter achieving 62.00% (2-wide) and 54.21% (4-wide) better clocks compared to the VEX CPU As with any engineering study, the chosen system very much depends on tradeoffs between speed and area costs These are detailed in the next section 1) VLSI Platform Results: This section details the VLSI implementation of a number of configurations of interest of the BioThreads engine implemented on both standard-cell (TSMC 0.13 Poly, metal, high-speed process) and a 40 nm state-of-the-art FPGA device (Xilinx Virtex6 LX240T-FF1156-1) 2) Xilinx Virtex6 LX240T-FF1156-1: To provide further insight on the use of advanced system FPGAs when implementing a CMP platform, the BioThreads processor was targeted to a midrange device from the latest Xilinx Virtex6 family The following configurations were implemented: a) Dual-issue BioThreads configurations (2 IALUs, IMULTs, LSU_CHANNELS per LE1 core, 4-banks STRMEM): 1, 2, and LE1 cores, each with a private 128 KB IRAM and a shared 256 KB DRAM, for a total memory of 384 KB, 512 KB, 1024 KB, and 1.5 MB respectively b) Quad-issue BioThreads configurations (4 IALUs, IMULTs, LSU_CHANNEL per LE1 core, 8-banks STRMEM): 1, and LE1 cores In this case, the same amount of IRAM and DRAM was chosen, for a total memory of 384 KB, 512 KB, and 1024 KB respectively Both dual and quad-issue configurations included a Microblaze 32-bit scalar processor system (5-stage pipeline, no FPU, no MMU, K ICache and DCache/Write-through, acting as the service processor) with a 32-bit single PLB backbone, to interface to the camera, stream-in frames, initiate execution on the BioThreads processor and extract the processed results (Oxygen map) upon completion Note that the service processor in these implementations are lower-spec to the standalone Microblaze processor used in the performance comparison at the end of the 266 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 TABLE V 2-ISSUE BIOTHREADS CONFIGURATIONS ON VIRTEX6 LX240T TABLE VI 4-ISSUE BIOTHREADS CONFIGURATIONS ON VIRTEX6 LX240T previous section as the Microblaze does not participate in the computations in this case The Xilinx Kernel RTOS (Xilkernel) was used to provide basic software services (stdio) to the service processor and thus, to the rest of the BioThreads engine Finally, the calculated Oxygen map was returned to the host system for display purposes in the LABVIEW environment The dual-issue configurations achieved the requested operating frequency of 100 MHz (this is the limit at which both the Microblaze platform and the BioThreads engine can operate with the external DDR3; however, the 4-wide configurations exat that target freposed a critical path for quency resulting in a lower overall system speed of 83.3 MHz – For the case where only the on-board block RAM is used, this is significantly increased) Table V depicts the post-synthesis results for the dual-issue platforms whereas Table VI depicts the synthesis results for the quad-issue platforms The above tables include the number of LE1 cores in the BioThreads engine (Contexts), the absolute and relative (for the given device) number of LUTs, flops and block RAMs used, and the number of LUTs and flops per the total number of issue slots for the VLIW CMP The latter two metrics are used to compare the hardware efficiency of the 2-wide and 4-wide configurations A number of interesting conclusions are drawn from the data: The 2-wide LE1 core is 27.12% to 29.60% more efficient (LUTs per issue slot) compared to the 4-wide core This is expected and it is attributed to the overhead of the pipeline bypass network The internal bypassing for one operand is shown in Fig 14 however, the LSU bypass paths are not depicted in the figure for clarity The quad-issue configurations exhibit substantial multiplexing overhead as the number of source operands is doubled 32-bit operands per VLIW core) with each, tapping ( to ISSUE_WIDTH * (IALU) + ISSUE_WIDTH * (IMULT) + ISSUE_WIDTH *2 (LSU) result buses In terms of absolute silicon area, the Virtex6 device can acissue commodate up to 11 dual-issue LE1 cores ( issue slots per cycle) slots per cycle) and up to ( quad-issue cores (with the service processor subsystem present in both cases); the quad-issue system however achieves 83 MHz Fig 14 Pipeline bypass taps as a function of IALUs and IMULTs instead of the required (by Table I) 100 MHz for real-time operation This suggests that the lower issue-width configurations are more beneficial for applications exhibiting good TLP (many independent software threads) while exploiting a lower amount of ILP As shown in the performance (real-time) section, the IPPG workload exhibits substantial TLP thus making the 2-wide configurations more attractive 3) Standard-Cell (TSMC0.13LV Process): For the standard-cell (ASIC) target, dual-issue (2 IALUs, IMULTs, LSU_CHANNELS, memory banks) and quad-issue (4 IALUs, IMULTs, LSU_CHANNELS, memory banks) configurations were used in 1, 2, and cores organizations To facilitate collection of results after long RTL-simulation runtimes, a scripting mechanism was used to automatically modify the RTL configuration files, execute RTL (presynthesis) regression tests, perform front-end synthesis (Synopsys dc_shell, nontopographical mode) and execute postsynthesis simulations Fig 15 shows the postroute (real silicon) area of the configurations studied In this case, the 4-wide is only 15.59% larger than the 2-wide processor with that number increasing to 18.96% The postroute area of a 4-core, dual-issue and a 3-core quad-issue system are nearly identical and these are the configurations depicted in Fig 16 STEVENS et al.: BIOTHREADS: A NOVEL VLIW-BASED CHIP MULTIPROCESSOR 267 larger frame sizes and/or faster frame rates Further work will investigate the introduction of single-instruction, multiple-data (SIMD) processor state and custom image processing vector instructions and their effect on accelerating algorithmic kernels from the same domain REFERENCES m sq.) Fig 15 VLSI campaign area ( m Fig 16 TSMC 0.13 VLSI results (A) instances of a 2-wide LE1 core (B) instances of a 4-wide LE1 core VI CONCLUSIONS This work discussed the methodology and evaluated the feasibility of using a novel, configurable VLIW CMP for accelerating biomedical signal processing codes Experts in the biomedical signal processing domain require advanced processing capabilities without having to resort to the expertise routinely utilized in the consumer electronics and telecommunications domains were the silicon platform is designed by one team, benchmarked and optimized by a second team and programmed by a third A deliberate choice was made to stay within the reach of tools routinely utilized by biomedical signal processing practitioners, such as MATLAB and LABVIEW; The BioThreads tools infrastructure was designed to facilitate C-level programmability in the particular application domain Use of the configurable, extensible BioThreads engine was demonstrated for computing, in real-time/near-real-time blood perfusion of living tissue, using algorithms developed in the Embedded MATLAB subset; The autogenerated C code was passed on to the toolchain which compiled it into an application binary and performed coarse architecture space evaluation to identify the best BioThreads configurations that achieve the required level of performance Following that, the code was loaded onto the CMP, residing on the FPGA and datasets were streamed from the host system (via the LABVIEW front-end) for accelerated calculation This work presented a custom signal processing engine capable of a high data-processing throughput at a significantly lower operating frequency than general purpose processors In the context of imaging PPG, real-time capability was demonstrated for a frequency-domain processing on 64 64 pixel images at 30 frames per second The system is easily scalable via its parameterization and allows for real-time execution on [1] K Rajan and L M Patnaik, “CBP and ART image reconstruction algorithms on media and DSP processors,” Microprocess Microsyst., vol 25, pp 233–238, 2001 [2] O Dandekar and R Shekhar, “FPGA-Accelerated deformable image registration for improved target-delineation during CT-guided interventions,” IEEE Trans Biomed Circuits Syst., vol 1, no 2, pp 116–127, 2007 [3] K Wardell and G E Nilsson, “Duplex laser Doppler perfusion imaging,” Microvasc Res., vol 52, pp 171–182, 1996 [4] S Srinivasan, B W Pogue, S D Jiang, H Dehghani, C Kogel, S Soho, J J Gibson, T D Tosteson, S P Poplack, and K D Paulsen, “Interpreting haemoglobin and water concentration, oxygen saturation, and scattering measured in vivo by near infrared breast tomography,” Proc Natl Academy Sciences USA, vol 100, no 21, pp 12349–12354, 2003 [5] S Hu, J Zheng, V A Chouliaras, and R Summers, “Feasibility of imaging photoplethysmography,” in Proc Conf BioMedical Engineering and Informatics, Sanya,, China, 2008, pp 72–75 [6] P Shi, V Azorin Peris, A Echiadis, J Zheng, Y Zhu, P Y S Cheang, and S Hu, “Non-contact reflection photoplethysmography towards effective human physiological monitoring,” J Med Biol Eng., vol 30, no 30, pp 161–167, 2010 [7] V A Chouliaras, J L Nunez, D J Mulvaney, F Rovati, and D Alfonso, “A multi-standard video accelerator based on a vector architecture,” IEEE Trans Consum Electron., vol 51, no 1, pp 160–167, 2005 [8] ARM Cortex M3 Processor Specification, Sep 2010 [Online] Available: http://www.arm.com/products/processors/cortex-m/cortexm3.php [9] Microblaze Processor Reference Guide, Doc UG081 (v10.3), Oct 2010 [Online] Available: http://www.xilinx.com [10] D Mattson and M Christensson, “Evaluation of Synthesizable CPU Cores,” Master’s thesis, Dept Computer Engineering, Chalmers Univ Technology, Goteborg, Sweden, 2004 [11] V A Chouliaras and J L Nunez, “Scalar coprocessors for accelerating the G723.1 and G729A speech coders,” IEEE Trans Consum Electron., vol 49, no 3, pp 703–710, 2003 [12] N Vassiliadis, G Theodoridis, and S Nikolaidis, “The ARISE reconfigurable instruction set extensions framework,” in Proc Intl Conf Emb Computer Systems: Architectures, Modelling and Simulation, Jul 16–19, 2007, pp 153–160 [13] Xilinx XPS Mailbox” V1.0a Data Sheet, Oct 2010 [Online] Available: http://www.xilinx.com [14] M F Dossis, T Themelis, and L Markopoulos, “A web service to generate program coprocessors,” in Proc 4th IEEE Int Workshop Semantic Media Adaptation and Personalization, Dec 2009, pp 121–128 [15] B Gorjiara, M Reshadi, and D Gajski, “Designing a custom architecture for DCT using NISC technology,” in Proc Design Automation, Asia and South Pacific Conf., 2006, pp 24–27 [16] V Kathail, S Aditya, R Schreiber, B Ramakrishna Rau, D Cronquist, and M Sivaraman, “PICO: Automatically designing custom computers,” IEEE Comput., vol 35, pp 39–47, 2002 [17] R Thomson, S Moyers, D Mulvaney, and V A Chouliaras, “The UML-based design of a hardware H.264/MPEG AVC video decompression core,” in Proc 5th Int UML-SoC Workshop (in Conjunction with 45th DAC), Anaheim, CA, Jun 2008, pp 1–6 [18] The AutoESL AutoPilot High-Level Synthesis Tool, May 2010 [Online] Available: http://www.autoesl.com/docs/bdti_autopilot_final.pdf [19] Y Guo and J R Cavallaro, “A low complexity and low power SoC design architecture for adaptive MAI suppression in CDMA systems,” J VLSI Signal Process Syst Signal Image Video Technol., vol 44, pp 195–217, 2006 [20] S Leibson and J Kim, “Configurable processors: A new era in chip design,” IEEE Comput., vol 38, no 7, pp 51–59, 2005 [21] N T Clark, H Zhong, and S A Mahlke, “Automated custom instruction generation for domain-specific processor acceleration,” IEEE Trans Comput., vol 54, no 10, pp 1258–1270, 2005 [22] D Stevens and V A Chouliaras, “LE1: A parameterizable VLIW chipmultiprocessor with hardware PThreads support,” in Proc IEEE Symp VLSI, Jul 2010, pp 122–126 268 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL 6, NO 3, JUNE 2012 [23] LE1 VLIW Core Source Code, Oct 2010 [Online] Available: http:// www.vassilios-chouliaras.com/le1 [24] V A Chouliaras, G Lentaris, D Reisis, and D Stevens, “Customizing a VLIW chip multiprocessor for motion estimation algorithms,” in Proc 2nd Workshop Parallel Programming and Run-Time Management Techniques for Many-Core Architectures, Lake Como, Italy, Feb 23, 2011, pp 178–184 [25] “Trimaran 4.0 Manual,” Oct 2010 [Online] Available: http://www trimaran.org [26] J A Fischer and P Faraboschi, Embedded Computing A VLIW Approach to Architectures, Compilers and Tools San Mateo, CA: Morgan Kaufmann, 2005 [27] R P Colwell, O P Nix, J J O’Donell, D B Papworth, and P K Rodman, “A VLIW architecture for a trace scheduling compiler,” IEEE Trans Comput., vol 37, no 8, pp 967–979, 1988 [28] T Ungerer, B Robic, and J Silc, “A survey of processors with explicit multithreading,” ACM Comput Surv., vol 35, no 1, pp 29–63, 2003 [29] HiveFlex CSP2500 Series Digital RF Processor Databrief, Apr 2011 [Online] Available: http://www.siliconhive.com/Flex/Site/Page aspx?PageID=13326 [30] A Suga and S Imai, “FR-V Single chip multicore processor: FR1000,” Fujitsu Sci Tech J., vol 42, no 2, pp 190–199, Apr 2006 [31] A K Jones, R Hoare, D Kusic, J Stander, G Mehta, and J Fazekas, “A VLIW processor with hardware functions: Increasing performance while reducing power,” IEEE Trans Circuits Syst., vol 53, no 11, pp 1250–1254, 2006 [32] B Mei, S Vernalde, D Verkest, H D Man, and R Lauwereins, “ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix,” in Proc Field Programmable Logic, 2003, pp 61–70 [33] I Al Khatib, F Poletti, D Bertozzi, L Benini, M Bechara, H Khalifeh, A Jantsch, and R Nabiev, “A multiprocessor system-on-chip for realtime biomedical monitoring and analysis: Architectural design space exploration,” in Proc IEEE/ACM Design Automation Conf., Sep 2006, pp 125–130 [34] P K Dubey, K O’Brien, K M O’Brien, and C Barton, “Single-program speculative multithreading (SPSM) architecture: Compiler- assisted fine-grained multithreading,” in Proc Int Conf Parallel Architecture and Compilation Techniques, Jun 1995, pp 109–121 [35] E Ozer, T M Conte, and S Sharma, “Weld: A multithreading technique towards latency tolerant VLIW processors,” in Processors Lecture Notes in Computer Science Berlin, Germany: Springer, 2001 [36] E Ozer and T M Conte, “High-performance and low-cost dual-thread VLIW processor using Weld architecture paradigm,” IEEE Trans Parallel Distrib Syst., vol 16, no 12, pp 1132–1142, 2005 [37] S G Ziavras, A V Gerbessiotis, and R Bafnaa, “Coprocessor design to support MPI primitives in configurable multiprocessors,” Integration, VLSI J., vol 40, pp 235–252, 2007 [38] CoreConnect Architecture, Oct 2010 [Online] Available: http://www xilinx.com [39] J Zheng, S Hu, V Azorin-Peris, A Echiadis, V A Chouliaras, and R Summers, “Remote simultaneous dual wavelength imaging photoplethysmography: A further step towards 3-D mapping of skin blood microcirculation,” in Proc BiOS SPIE, 2008, vol 6850, pp 68500S–1 [40] MC1311 – High-Speed 1,3 MPixel Colour CMOS Imaging Farbkamera, Oct 2010 [Online] Available: http://www.mikrotron.de [41] FG1156 FPGA Datasheet, Oct 2010 [Online] Available: http://www xilinx.com [42] D Stevens, N Glynn, P Galiatsatos, V A Chouliaras, and D Reisis, “Evaluating the performance of a configurable, extensible VLIW processor in FFT execution,” in Proc Int Conf Environmental and Computer Science, 2009, pp 771–774 David Stevens was born in Melton Mowbray, U.K., in 1985 He received the B.Eng degree in computer systems engineering from Loughborough University, Leicestershire, U.K., in 2007 He began working toward the Ph.D degree at Loughborough University in 2008, investigating methods of acceleration for a VLIW processor His research focuses on both the implementation and verification of a tool-chain for an open source VLIW processor incorporating various methods of acceleration, including hardware threading Currently, he is a Research Associate at Loughborough University working on a project to develop an integrated framework for embedded system tools Vassilios Chouliaras was born in Athens, Greece, in 1969 He received the B.Sc degree in physics and laser science from Heriot-Watt University, Edinburgh, Scotland, the M.Sc degree in VLSI systems engineering from the University of Manchester Institute of Science and Technology, Manchester, U.K., and the Ph.D degree from Loughborough University, Leicestershire, U.K., in 1993, 1995, and 2006, respectively He has worked as an ASIC Design Engineer for INTRACOMSA and as a Senior R&D Engineer/Processor Architect for ARC International Currently, he is a Senior Lecturer in the Department of Electronic and Electrical Engineering, Loughborough University, where he is leading the research in CPU architecture and microarchitecture, SoC modeling, and ESL methodologies He is the architect of the BioThreads platform and a Founder of Axilica Ltd Vicente Azorin Peris was born in Valencia, Spain in 1981 He received the B.Eng degree in electronic and electrical engineering and the Ph.D degree in biomedical photonics engineering from Loughborough University, Leicestershire, U.K., in 2004 and 2009, respectively For the past five years, he has worked on functional biomedical sensing and imaging techniques, including finite-element simulation of the interaction between light and human tissue as a Ph.D student and as a Research Associate in the Photonics Engineering and Health Technology Research Group at Loughborough University He is currently employed by Ansivia Solutions Ltd., Loughborough, U.K., where he is involved in research and development of in-vitro diagnostic point-of-care testing instrumentation Jia Zheng was born in Heilongjiang, China, in 1980 She received the B.Sc degree in electrical engineering from the Beijing Institute of Technology, Beijing, China, in 2003, and the M.Sc degree in digital communication systems and the Ph.D degree in biomedical photonics engineering from Loughborough University, Leicestershire, U.K., in 2005 and 2010 respectively During her Ph.D studies, she specialized in imaging photoplethysmography and opto-physiological monitoring Recently, she joined the National Institute for the Control of Pharmaceutical and Biological Products (NICPBP), Beijing, China, as Researcher for the regulation of in-vivo physiological assessment device and instrumentation Angelos Echiadis was born in Orestiada, Greece, in 1977 He received the 1st Class B.Eng degree in electrical and electronic engineering with distinction from Heriot-Watt University, Edinburgh, Scotland, and the Ph.D degree in photonics and health technology from Loughborough University, Leicestershire, U.K., in 2003 and 2007, respectively Currently, he works as a biomedical systems engineering professional at Dialog Devices Ltd., Loughborough, U.K His research interests include multidimensional signal acquisition and characterization of live tissues, as well as photoplethysmography in pulse oximetry and in other medical applications Sijung Hu (SM’10) was born in Shanghai, China, in 1960 He received the B.Sc degree in chemical engineering from Donghua University, Shanghai, China, the M.Sc degree in environmental monitoring from the University of Surrey, Guildford, U.K., in 1995, and the Ph.D degree in fluorescence spectrophotometry from Loughborough University, Leicestershire, U.K., in 2000 He was appointed as a Senior Scientist at Kalibrant Ltd., Loughborough, U.K., for R&D of medical instruments from 1999 to 2002 Currently, he is a Senior Research Fellow and Leader of Photonics Engineering Research Group in the Department of Electronic and Electrical Engineering, Loughborough University, and Visiting Professor of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China, with numerous contributions in opto-physiological assessment, biomedical instrumentation and signal and image processing ... thread management is hardware based) and is differentiated to offerings in that area via a) Its hardware PThreads primitives and b) its massive scalability which can range from a single-thread,... Data were streamed to the BioThreads platform and the 64-point fast Fourier transform (FFT) of each pixel was calculated This was done by taking the pixel values of all image frames for a particular... integer and floating-point datapaths are of unequal pipeline Thread management and dynamic allocation to hardware contexts takes place in the TCU This is a set of hierarchical state machines,