Hindawi Publishing Corporation EURASIP Journal on Embedded Systems Volume 2006, Article ID 81309, Pages 1–13 DOI 10.1155/ES/2006/81309 FPGA-Based Communications Receivers for Smart Antenna Array Embedded Systems Constantin Siriteanu, 1, 2 Stev en D. Blostein, 1 and James Millar 3 1 Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, Canada K7L 3N6 2 Communications Signal Processing Laboratory, Department of Electrical and Computer Engineering, Hanyang University, Seoul, Korea 3 CMC Microsystems, Kingston, ON, Canada K7L 3N6 Received 15 December 2005; Revised 7 May 2006; Accepted 2 June 2006 Field-programmable gate arrays (FPGAs) are drawing ever increasing interest from designers of embedded wireless communica- tions systems. They outpace digital signal processors (DSPs), through hardware execution of a wide range of parallelizable commu- nications transceiver algorithms, at a fraction of the design and implementation effort and cost required for application-specific integrated circuits (ASICs). In our study, we employ an Altera Stratix FPGA development board, along with the DSP Builder software tool which acts as a high-level interface to the powerful Quartus II environment. We compare single- and multibr anch FPGA-based receiver designs in terms of error rate performance and power consumption. We exploit FPGA operational flexibility and algorithm parallelism to design eigenmode-monitoring receivers that can adapt to variations in wireless channel statistics, for high-performing, inexpensive, smart antenna array embedded systems. Copyright © 2006 Constantin Siriteanu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION Conventional wireless communications systems employ a single receiving antenna. Enhanced, antenna array receivers employing beamforming (BF) and maximal-ratio combin- ing (MRC) can generate antenna and diversity gain, that is, increased average and instantaneous (with respect to chan- nel fading) receiver signal-to-noise ratio (SNR) [1–4]. Al- though beneficial in terms of performance, these enhanced, multibranch algorithms can require much larger compu- tational volumes than the conventional, single-branch re- ceiver. Recent analytical a nd simulation studies [1–4]of a hybrid algorithm entitled maximal-ratio eigencombining (MREC) claimed efficient performance-complexity tradeoffs for smart antenna arrays. Receiver algorithms have traditionally been deployed on general-purpose, sequential, digital sig nal processors (DSPs), or on application-specific integrated circuits (ASICs). En- hanced receiver algorithms, which are generally highly par- allelizable, and higher data transmission rates can burden DSPs beyond their capacit y for real-time processing. Time- critical, highly parallelizable applications are common in ar- eas ranging from modern communications [5–7]toimage [6] and speech [8] processing, and even bioinformatics [9]. ASICs are hardwired for specific tasks. Although fast (some- times several orders of mag nitude faster than DSPs, through hardware parallelism) and power-efficient, implemented de- signs are inflexible [7]. More importantly, ASIC design and production are time-consuming and extremely expensive for chips produced in small numbers, due to very high non- recurring engineering cost. Unlike ASICs, field-progr ammable gated arrays (FPGAs) are reconfigurable, that is, their internal structure is only partial ly fixed at fabrication, leaving to the application de- signer the wiring of the internal logic for the intended task. This can significantly shorten design and production, and thus t ime to market, for FPGA-based embedded sys- tems. Although FPGAs tend to be slower and to consume more power than ASICs [7], FPGA reconfigurability can benefit platform longev ity (which is extremely important in an era of fast-changing wireless communications stan- dards) by allowing design changes/upgrades even in sys- tems already in operation. This flexibility can be effectively exploited for rapid prototyping of advanced communica- tions signal processing, such as Bell Labs Layered Space-Time (BLAST) multi-input multi-output (MIMO) architecture for third-generation Universal Mobile Telecommunications Sys- tem (UMTS) [5]. Furthermore, an FPGA can, for example, 2 EURASIP Journal on Embedded Systems implement MRC branches either sequentially, or in paral- lel, or anywhere in between, depending on required speed, available chip resources, and power constraints. FPGA-based implementations concurrently operating several hardware modules can outpace many times their processor-based counterparts [6, 9]. An insightful DSP, FPGA, and ASIC implementation comparison for a four-antenna orthogo- nal frequency-division multiplexing (OFDM) receiver can be found in [7]. FPGAs are especially well suited for embedded systems (e.g., cellular system base station line cards, or mobile sta- tions) because, beside an area of reconfigurable logical ele- ments, they can also incorporate large amounts of memory, high-speed DSP blocks, clock management circuitry, high- speed input/output (I/O), as well as support for external memory, and high-speed networking and communications bus standards. For a small share of the resources, processors can be included within the FPGA fabric as well [9]. Power consumed in embedded systems is, in general, strictly limited. Otherwise, line-powered designs would re- quire special and/or expensive power sources and heat sinks ormaynotoperatereliably,whileportabledeviceswould quickly deplete the battery [10, 11]. Although FPGA chips are judiciously manufactured for power efficiency, applica- tion designers also need to carefully consider this issue be- cause a consistently underutilized design wastes static and dynamic powers [10–13]. The objective of this paper is to investigate FPGA suit- ability for efficient smart antenna array embedded receivers. In the process, we overview an Altera FPGA-based design environment, and implement conventional and enhanced (BF, MRC, MREC) receiver algorithms. It is demonstrated that FPGA implementations of eigenmode-based combining adapted to the slow variations in channel statistics can yield near-optimum bit error rate (BER) performance, for afford- able power budgets. The paper is organized as follows Section 2 presents the received signal model, and overviews BF, MRC, and MREC. Section 3 describes the Altera software and hardware em- ployed to design, simulate, analyze, and implement these re- ceiver algorithms. Comparative performance and cost results are provided in Section 4. 2. SIGNAL MODEL AND COMBINING METHODS 2.1. Received-signal model Consider a source transmitting a BPSK signal through a frequency-flat Rayleigh fading channel, and an L-element re- ceiving antenna array. After demodulation, matched filter- ing, and symbol-rate sampling, the complex-valued received signal vector is given by [4] y = E s b h + n,(1) where dependence on the sampling time is not explicit, to simplify notation. The L elements y i , i = 1:L 1, , L, of the received signal vector y = [y 1 y 2 ···y L ] T are called branches, and the elements h i , i = 1:L, of the channel vec- tor h = [ h 1 h 2 ··· h L ] T ,arecalledchannel gains.In(1), E s is the energy transmitted per symbol, and b is the transmit- ted BPSK symbol, with |b| 2 = 1(b = 1 for transmitted bit 0, b =−1 for transmitted bit 1). We assume that the channel vector h and the noise vector n are complex-valued, mutually independent, zero-mean Gaussian, with h ∼ CN (0, R h )and n ∼ CN (0, N 0 I L ), respectively. Further assumptions are that channel fading [14] is frequency-flat with unit variance on each branch, the noise is temporally white, and the received signal is interference-free. This signal model is simple, yet sufficient for basic performance evaluations [15]. Current- standard wireless communications signaling is beyond the scope of this work. 2.2. Azimuth angle spread model Due to radio-wave scattering, t ransmitted signals are re- ceived with azimuthal dispersion [14, 16]. Without loss of generality, numerical results presented herein assume trun- cated Laplacian power azimuth spectr um (p.a.s.) [4]because it accurately models empirical results [16]. The p.a.s. root second central moment is denoted as azimuth spread (AS) [16]. Analytical expressions for the elements of R h , obtained through straightforward calculations in [4] for a uniform lin- ear array (ULA), indicate that antenna correlation (and thus receiver BER performance [1, 2]) is a function of p.a.s. type, azimuth spread, average angle of arrival (which is assumed to be zero with respect to the broadside, for all the results shown later), and normalized interelement distance d n (i.e., the ratio between the physical interelement distance and half of the carrier wavelength). The azimuth spread depends on the environment and an- tenna array location/height, and is variable [16]. Radio chan- nel measurements for sub/urban scenarios [16] showed that base station azimuth spread is well modeled as a log-normal random variable [16, equation (9)]. For typical urban sce- narios [16, Table I], these measurements found that base- station azimuth spread correlation decreases exponentially with the distance traveled by the mobile [16, equation (14)]. The azimuth spread decorrelation distance, that is, the dis- tance over which the azimuth spread correlation decreases byafactoroftwo,wasdeterminedasd AS = 50 m [16]. Com- paring d AS with the fading coherence distance [17,equation (4.40.b)] d c computed for the typical system parameter val- ues from Table 1, we conclude that the azimuth spread vari- ation is much slower (by about 3 orders of magnitude) than the fading. Furthermore, for this typical urban scenario, it was found in [16] that Pr(1 ◦ < AS < 20 ◦ ) ≈ 0.8, that is, azimuth spread is small to moderate, producing significant (greater than 0.5) correlations between adjacent elements of a compact ULA, for example, d n = 1[1, 3]. 2.3. MRC and BF For perfectly known channel (p.k.c.), the optimum (maxi- mum-likelihood) receiver linearly combines the received signal vector with the channel vector, that is, it computes Constantin Sir iteanu et al. 3 Table 1: Mobile, channel, and receiver (channel estimation) pa- rameters. Parameter Valu e Mobile speed v = 60 km/h TransmittedBPSKsymbolrate f s = 10 ksps Carrier frequency f c = 1.8GHz Pilot symbol period [18, Section III.C] M s = 7 Maximum Doppler frequency f D = 100 Hz Normalized maximum Doppler frequency f m = f D /f s =0.01 Channel coherence time [17, equation (4.40.b)] T c ≈ 1.8ms Channel coherence distance d c = v, T c ≈ 30 mm Interpolator size [18, Section III.D] T = 11 h H y, and then detects the BPSK symbol as b = sign h H y . (2) This approach is also known as maximal-ratio combining (MRC) [19] because it maximizes the SNR (instantaneous, i.e., conditioned on the channel gains) at the combiner’s out- put. MRC with L = 1 reduces to the conventional, single- branch, receiver. In actual systems, with imperfectly known channel (i.k.c.), knowledge of the channel gains is acquired through estimation [1, 18].Thereceivedsymbolcanthenbedetected as b = sign{[g H y]},whereg = [g 1 g 2 ···g L ] T ,andg i , i = 1:L, are the channel gain estimates. This combining approach has often been employed and studied [1, 3, 15, 19], although it is suboptimal (when the channel gains are not independent and identically distributed—non-i.i.d.)[3]. MRC is known to provide full diversity gain [19]—that is, the greatest performance improvement, averaging over fading and noise, compared to a single-branch system—for i.i.d. branches. This requires either widely spaced elements, which are unfeasible for pocketsize mobile stations, or rich scattering, which is unlikely at base stations [16]. For narrow azimuth spread, received signals are highly correlated [1, 2] and the received signal energy, proportional to tr(R h ) L i=1 (R h ) i,i = L i=1 λ i ,whereλ i , i = 1:L, are the eigenvalues of R h , is concentrated within the first few eigen- modes. Then, the channel is said to be spatially nonselective, and the available diversity gain is small [20–22]. Enhanced performance can then be obtained by taking advantage of an- tenna gain using maximum average SNR beamforming (BF), that is, by combining the received signal vector with the dom- inant eigenvector of R h [1–4]. Increasing azimuth spread de- creases antenna correlation, that is, the channel becomes spa- tially more selective and higher diversity gain becomes avail- able [1–4]. In subsequent sections, we show how to exploit available antenna and diversity gains within complexity and power constraints. 2.4. Eigencombining method BF has traditionally been applied in scenarios with very small azimuth spread. Otherwise, MRC has been employed. How- ever, it was recently claimed that a unifying approach, called maximal-ratio eigencombining (MREC), and described below, can adapt to channel correlation (i.e., azimuth spread) variation [1–4, 20]. Our analytical and simulation results have shown that MREC may thus outperform MRC and BF in terms of BER performance a nd complexity [1–4]. The channel correlation matrix R h hasrealnonnegative eigenvalues λ 1 ≥ λ 2 ≥ ··· ≥ λ L ≥ 0, orthonormal eigenvec- tors e i , i = 1:L, and can be decomposed as R h = E L Λ Λ Λ L E H L , where Λ Λ Λ L diag{λ i } L i =1 is a diagonal matrix, and E L [e 1 e 2 ···e L ] is a unitary matrix. Hereafter, R h , Λ Λ Λ L ,andE L are assumed perfectly known because, in practice, enough inde- pendent channel samples would be available for an accurate estimation. Actual MREC could employ computationally in- significant low-rate eigenstructure updating [20]. MREC of order N consists of the following steps [1–4]: (i) Karhunen-Lo ` eve transformation (KLT) [22] of the re- ceived signal vector from (1) with the full-column rank matrix E N [e 1 e 2 ···e N ]; the elements of the trans- formed signal vector, y = E H N y = E s bE H N h + E H N n = E s bh + n,aredenotedaseigenbranches; (ii) MRC of the N eigenbranches. The components of the transformed channel gain vector h = E H N h are further referred to as channel eigengains.Theyare mutually uncorrelated, with zero mean, and variances σ 2 h i E {|h i | 2 }=λ i , that is, R h E{hh H }=Λ Λ Λ N = diag{λ i } N i =1 , for any channel gain distribution [21]. From the initial as- sumptions on fading and noise we obtain h ∼ CN (0, Λ Λ Λ N ), and n = E H N n ∼ CN (0, N 0 I N ), so that the eigengains are in- dependent, which supports straightforward MREC analysis [1–4]. Of all possible transforms, the KLT packs the largest amount of energy from the original, L-dimensional signal vector y into the transformed, N-dimensional signal vector y [22], which is desirable for dimension (i.e., complexity) re- duction. Note also that MREC of order N = 1 represents in fact BF, while it can be shown that full-MREC, that is, MREC of order N = L,isequivalenttoMRC[1–4]. 2.5. Order selection for MREC A simple criterion for optimal MREC order selection is [21] min N=1:L E s · L i=N+1 λ i + N 0 · N ,(3) better known as the bias-variance tradeoff criterion [3, 4] (BVTC) because (3) balances the loss incurred by remov- ing the weakest (L − N) intended-signal contributions (the first term) ag ainst the residual-noise contribution (the sec- ond term). Computer evaluations found the BVTC effec- tive for MREC adaptation to channel conditions [3, 4]. Note 4 EURASIP Journal on Embedded Systems Native blocks (floating point) Signal processing Communications Channels, noise Simulink DSP builder blocks (fixed point) Signal compiler Arithmetic HIL Gate, control, rate change Storage, I/O, bus IBM PC Compiler, simulator Synthesizer & fitter Timing analyzer Powerplay analyzer Chip programmer Quartus II MATLAB Altera stratix EP1S80B956C6 FPGA - Process: 1.5V,0.13 μm, SRAM - Chip pins: 956 - Programmable: 79 040 logic elements - DSP blocks: up to 176, 9 9 bit, embedded multipliers - Clocking: up to 16 global clocks; 12 real-time reconfigurable PLLs - Memory: approx. 7.5MbRAM - Interfaces: DDR/SDR DRAM, rapidIO, ethernet, PCI Altera stratix EP1S80 DSP development board Figure 1: FPGA development system hardware/software diagram. however that since BVTC disregards the MREC complexity, it can overload limited resources. Adifferent MREC adaptation criterion is described next. Assume that signals received (independently) from N u mo- bile stations require processing at a base station with only N e N u L available eigenbranch processing modules. Then, a control algorithm determines the largest (dominant) N e eigenmodes among all transmitting mobiles, and allocates available resources accordingly. For instance, if a receiving antenna array system w ith L = 4elementshasonlyN e = 3 available eigenbranch processing modules while N u = 2, the available resources are allocated as follows: if the 3 largest eigenvalues (out of N u L = 8) are such that two correspond to User 1, and one to User 2, then two eigenbranch process- ing modules are allocated to process the received signal vec- tor from User 1, and the other available eigenbranch is al- located to User 2. This approach to selecting eigenbranches for MREC is hereafter denoted as the eigenvalue-based trade- off criterion (EVTC), while MREC adapted based on EVTC is referred to as EVTC MREC. 2.6. Channel estimation using pilot-symbol-aided modulation (PSAM) In PSAM, the transmitter periodically inserts known pilot symbols b p of energy E p (= E s for results shown herein), into the information-encoding symbol stream, and the receiver interpolates the pilot samples acquired a cross several slots to estimate the channel during data symbols [1–4, 18]. The notation (t, m) is used below to denote temporal indexing, where t =−T 1 : T 2 is the time slot index, and m = 0:M s − 1 is the symbol index within the slot of length M s .Heret = 0 refers to the slot in which estimation takes place, m = 0cor- responds to pilot symbols, and m = 1:M s − 1 corresponds to data-encoding symbols; T = T 1 + T 2 + 1 slots (in general, T 1 = T 2 ) are used for interpolation. The estimate of the ith eigengain at the mth data symbol position in the current slot can be written as g i (0, m) = v H i (m)r i ,(4) where v i (m) is the interpolation filter and r i 1 E p b p y i − T 1 ,0 , , y i T 2 ,0 T (5) contains the samples taken during pilot symbols. The interpolation filter chosen for the numerical results shown later is the filter with brick-wall-typ e frequency re- sponse, which is optimum in the absence of noise; we will refer to this filter, with impulse-response tapered by a raised- cosine window [1, 2], as the SINC filter, and the correspond- ing estimation approach as SINC PSAM. The interpolator coefficients, given by v(m) t+T 1 +1 = sinc m M − t cos[πβ(m/M − t)] 1 − [2β(m/M − t)] 2 ,(6) enter the FPGA-based receiver designs from Section 4.Note that channel estimation is among the most demanding re- ceiver functions resource-wise [5]. 3. FPGA HARDWARE AND SOFTWARE 3.1. FPGA system description CMC Microsystems provided the system shown in Figure 1. The Altera DSP Development Kit Stratix Professional Edition, which comprises the Stratix EP1S80 DSP develop- ment board, is built around the Stratix EP1S80B956C6 FPGA Constantin Sir iteanu et al. 5 chip, and comes with the DSP Builder interface to the Quar- tus II design flow. Quartus II provides a comprehensive design, synthesis, and analysis environment for system-on-a-programmable- chip (SoPC) applications. DSP Builder helps, create the hardware representation of the required digital signal pro- cessing functions using the MATLAB and Simulink user- friendly algorithm-development environments, for shorter design and implementation cycles. MATLAB functions and native Simulink blocks can be combined with Altera DSP Builder library blocks (see Figure 1) to create FPGA designs which can be simulated under Simulink. For automated de- sign flow, the “signal compiler” block, which is at the core of DSP Builder, can generate hardware description language (HDL) code, and scripts for Quartus II-based synthesis and fitting from within Simulink. Furthermore, the DSP Builder “hardware in the loop” (HIL) block enables chip program- ming for hardware-software cosimulation. 3.2. Power usage considerations Power loss in FPGA devices can be categorized as static and dynamic [10–13]. Static (standby) power is consumed by the chip when no input signals are exercised [10]. This loss occurs due to transistor leakage, which is frequency- independent, but highly dependent on junction tempera- ture and tr ansistor size. Static power has been increasing (exponentially,atprocessesbelow0.25μm[11]) with each finer semiconductor technology, to become the dominant loss component in current chips. This is a concern for de- signers of portable embedded systems which spend long in- tervals in standby mode [10]. Dynamic power is consumed in normal operation, due to the charging and discharging of the internal capacitive loads, and is proportional to gate out- put load, square of the supply voltage, clock frequency, and gate switching activity [10–13]. Although the supply volt- age has decreased significantly in newer process technologies, high operating frequencies can still yield significant dynamic power losses [10]. A tight power budget may thus limit clock speed. Line-powered embedded systems are more competitive when they require less expensive power supplies and cooling devices [10]. Designs for portable products should aim for the longest possible battery life. Moreover, devices operating at high temperatures can become unreliable, emphasizing the importance of minimizing power consumption in embedded systems. FPGA structure is judiciously desig ned to minimize power losses [10–12, 23]. Nonetheless, power-aware applica- tion design can also increase efficiency, for example, by using gated clock signals, and thus virtually turning off unneces- sary chip sections [10, 12, 23]. Gating as close as possible to the clock source is a good practice since clock signal trees are important dynamic power consumers [12]. On the other hand, static power consumption can be reduced by adap- tive distribution of available FPGA resources, as shown in Section 4.3. For the designs described further below, we relied on Quartus II reports on resource usage, for example, the num- ber of logic elements (LEs), chip pins, and dedicated 9 ×9-bit DSP blocks. Static and dynamic power losses were estimated using the Quartus II Powerplay analyzer (dynamic power was estimated for default toggle rates of 12.5%). 4. FPGA-BASED WIRELESS COMMUNICATIONS RECEIVERS For the system shown in Figure 1, we focus on FPGA-based receiver algorithm implementation, assuming availability of digitized received signals. The transmitted signal and chan- nel/receiver impairments, that is, noise and temporally and spatially correlated fadings, are generated in MATLAB and Simulink. Various receiver algorithms were simulated and run from the FPGA, through DSP Builder HIL. Computer simulations and the corresponding hardware/software HIL co-simulations were found to perform identically. Computa- tions done in MATLAB or with native Simulink blocks are very precise, due to flo a ting-point number representation. On the other hand, DSP Builder relies on fixed-point rep- resentation, which can limit the dynamic range and can in- troduce quantization noise. As mentioned earlier in Table 1 , we consider a scenario with Doppler spread f D = 100 Hz and transmission rate f s = 10 ksps, that is, normalized Doppler spread f m = 0.01 Hz. PSAM with slot length M S = 7 (1 pilot symbol followed by 6 information-encoding symbols) is combined with SINC in- terpolation over T = 11 slots (T 1 = T 2 = 5), for channel estimation as in (4)–(6). ULA with d n = 1isassumedtopro- vide the received signals for the enhanced receivers. 4.1. Conventional, single-branch versus enhanced, multibranch MRC receivers In this section, a conventional, single-branch receiver, and an enhanced MRC receiver, w ith L = 2 i.i.d. branches, are con- sidered. We employ the well-established Jakes’ model [14]for temporal channel fading correlation, with parameters given in Tabl e 1. For BPSK, receiver BERs were computed for per- fectly known channel (p.k.c.), as well as imperfectly known channel (i.k.c.) for SINC PSAM. We verified that BER ex- pressions derived in [1] and the corresponding MATLAB simulation results agree closely for p.k.c. as well as for i.k.c. Then, for i.k.c., FPGA-based designs were simulated as well as hardware-software (HIL) cosimulated. For HIL cosimula- tion, the receiver design is compiled and then downloaded into the FPGA chip. Afterwards, received signals emulated using MATLAB are processed online by the programmed FPGA. In terms of numerical representation precision within the FPGA for the computer-generated received signal y,two cases are compared next: (1) 8 bits for the integer part and 8 bits for the fractional part (denoted further as 8.8); (2) the 4.4 case. Finally, the channel gain estimation root mean-square error (RMSE) is determined from theory [4], simulations, and HIL implementations. The upper part of Figure 2 shows the Simulink/DSP Builder design involved in channel gain estimation for one branch, while the lower part details our “SINC interpolator” 6 EURASIP Journal on Embedded Systems [g 1 re] [y 1 im] [g 1 im] Shift taps d 1taps t 0 Shift taps 1 d 1taps t 0 Binary point casting [y 1 re] [8].[8][16].[0] [Pilot indicator] [Symbol position indicator] Multiply-add a 0 [12].[8] b 0 [12].[8] y = a 0 b 0 + a 1 b 1 a 1 [12].[8] b 1 [12].[8] y[25].[16] Σ [25].[16][25].[8] Round [re g 1 conj x y 1 ] SINC interpolator Input y[15 : 0] Input pilot indicator Output from interpolator [12].[8] Input symbol position indicator [2 : 0] g 1 r = estimate of h 1 r [g 1 re] Received signal y 1 , fixed-point represented as 8.8 is left-shifted 8 positions to obtain a 16-representation, because the SINC interpolator requires integers, SINC interpolator for the imaginary part not shown 3 Input symbol position 12 : 0 Input symbol position indicator [2 : 0] Input y re im 1 Input y[15 : 0] i15 : 0 Input pilot indicator 2 Input pilot indicator Bit d Ena t 0 t 1 t 2 t 3 t 4 11 taps t 5 t 6 t 7 t 8 t 9 t 10 Shift taps Sum of products 1 10 20 36 67 161 989 118 55 30 17 8 Sum of products Σ q(29 : 0) Σ q(28 : 0) 0 Constant Parallel adder subtractor + + + For symbol positions: m = 2:6 Sel [2 : 0] 0 1 2 3 4 5 6 MUX n-to-1 multiplexer [30].[0] [12].[18] [12].[8] o[12].[8] 1 [12].[18] Binary point casting Round Output Output from interpolator [12].[8] Actual SINC interpolator coefficients needed a left-shift by 10 binary positions, to obtain the “Sum of products” coefficients Figure 2: Simulink model detail with DSP Builder blocks implementing channel gain estimation (through SINC interpolation) for MRC. design. (Symbols appear without the tilde due to Simulink editing limitations.) The upper “shift taps” DSP Builder blocks delay the received signal by (T 1 +1)M s = 42 sam- ples, while the “multiply-add” block computes (g ∗ 1 y 1 ), used as test variable for symbol detection. Since the DSP Builder blocks “sum of products” in the “SINC interpolator” design require integer input and coefficients, binary shift- ing of the received signal and inter polator coefficients (com- puted from [1, Table 1]) is required. The “SINC interpolator” “shift taps” block outputs (r 1 ), see (5), while the “parallel Adder/Subtractor” outputs (g 1 )—see (4). The interpolator output is then used for combining. Notice that channel esti- mation can be very demanding resource-wise, especially for multibranch receivers. The RMSE subplot in Figure 3 indicates that 4.4and 8.8 fixed-points FPGA computation does not visibly de- grade channel estimation accuracy compared to floating- point (computer) computation. Nevertheless, the lower sub- plots show that fixed-point computation with narrow word (i.e., poor precision, narrow dynamic range) can significantly degrade BER performance, an effect which cumulates w ith more branches. Figure 3 also indicates that the performance degrada- tion (i.e., about 3.4 dB) which occurs for a conventional re- ceiver due to i.k.c. can be successfully compensated for an FPGA-based dual-branch MRC, due to its diversity gain. Confidence intervals for all these results are very tight, since 10 000 slots, that is, 60, 000 data symbols, were detected. Constantin Sir iteanu et al. 7 f m = 0.01; i.k.c. SINC PSAM: M s = 7, T = 11 1 0.8 0.6 0.4 RMSE 012345678 E s /N 0 (dB) Fixed point, 4.4 Fixed point, 8.8 Floating point (a) Conventional, single-branch receiver, L = 1 10 1 10 2 BER 012345678 E s /N 0 (dB) Fixed point, i.k.c., 4.4 Fixed point, i.k.c., 8.8 Floating point, i.k.c. Floating point, p.k.c. (b) Enhanced, MRC receiver, L = 2 i.i.d. branches 10 1 10 2 10 3 BER 012345678 E s /N 0 (dB) Fixed point, i.k.c., 4.4 Fixed point, i.k.c., 8.8 Floating point, i.k.c. Floating point, p.k.c. (c) Figure 3: (a) RMSE for channel gain estimates. (b) and (c) Perfor- mance of the conventional, single-branch receiver, and of the dual- branch MRC receiver for various computer- and FPGA-based im- plementations. Fixed-point results correspond to both DSP Builder- based simulations and HIL implementations. For designs shown hereafter, we settled for an 8.8- representation, since it was found to offer a fair compro- mise between representation accuracy/dynamic range (i.e., receiver performance) and FPGA resource utilization. Fur- thermore, we instructed DSP Builder to allocate hard-wired DSP circuitry embedded into the reconfigurable FPGA fab- ric, which yields effective and efficient chip utilization [7]. Then, Quartus II reports on FPGA resource usage, maxi- mum allowable clock frequency (CF), and dynamic power (DP) usage, as shown in Table 2. Estimated static power loss is 1.395 W. Note that for the BER advantage shown Table 2: Resource usage for 8.8 implementations of MRC, BF, and adaptive MREC, for up to L = 4 branches. Method LEs Pins DSP CF DP (79 040) (692) (176) (MHz) (mW) MRC 13,227 43 16 41.06 69.35 L = 1 16.73% 6.21% 9.09% MRC 26,478 83 32 38.56 119.67 L = 2 33.49% 11.99% 18.18% MRC 39,731 123 48 38.35 169.78 L = 3 50.27% 17.77% 27.27% MRC 55,983 167 64 36.74 221.62 L = 4 70.83% 24.13% 36.36% BF 13,457 259 48 40.57 74.95 L = 4 17.02% 37.43% 27.27% BVTC MREC 13,458 262 48 41.15 74.95 L = 4, N = 1 17.02% 37.86% 27.27% BVTC MREC 26,940 358 96 39.73 130.89 L = 4, N = 2 34.08% 51.73% 54.54% BVTC MREC 40,423 454 144 39.09 186.64 L = 4, N = 3 51.14% 65.60% 81.81% BVTC MREC 55,847 550 176 38.82 244.64 L = 4, N = 4 70.66% 79.48% 100% EVTC MREC 13,561 424 48 41.09 75.67 L = 4, N = 1 17.16% 61.27% 27.27% EVTC MREC 27,372 524 96 39.14 132.95 L = 4, N = 2 34.63% 75.72% 54.54% EVTC MREC 40,983 624 144 35.43 189.23 L = 4, N = 3 51.85% 90.17% 81.81% in Figure 3 over the conventional receiver, dual-branch MRC nearly doubles resource requirements and dynamic power loss. Since the MRC performance gradient dimin- ishes with increasing number of branches [4], implementa- tion/operational costs can be minimized either with tightly matched chips, or through clock gating of excess resources. In the above MRC receiver design, channel gains on dif- ferent branches were considered statistically independent, for simplicity. However, this is rarely the case in practice [16]. Although scattering is richer around the mobile than around the base station, mobile antenna array size limitations can still lead to large interbranch correlation, that is, scarce diver- sity gain availability. Then, adaptive MREC [3, 4]maypro- vide more suitable tradeoffs between performance and re- source/power utilization, as shown next. 4.2. Enhanced MREC receiver designs: the case of a single user processed per FPGA chip We extended the prev iously discussed FPGA-based MRC re- ceiver design to support L = 4 branches, and also designed the BF, and the BVTC adaptive MREC receivers. See Table 2 8 EURASIP Journal on Embedded Systems MATLAB functions and scripts/native Simulink blocks (floating point) Tra nsm itter Data source Slot generator, PSAM; M s = 7:p = 0; d = random 0/1 d p dddddd p ddddd d BPSK modulator input = 0; b = +1 input = 1; b = 1 E 1/2 b Channel Azimuth spread generator R h Λ E Fading f m = 0.01 h Noise n λ i N 0 Oscillator/PLL clk BVTC MREC order selector MREC adaptation N Clock gating emulation if i N,clk i = clk if i>N,clk i = inactive clk i e i KLT clk i e H i y y Delay and storage r i = [y i ( T 1 ,0), , y i (+T 2 ,0)]/(E 1/2 p b p ) clk i y i Interpolation g i = v H r i Channel estimation Receiver DSP builder Simulink blocks (fixed point) ( ) g 1 y 1 g i y i g N y N Data sink BPSK demodulator b = Re( ); b>0, output 0 b<0, output 1 Figure 4: Transmitter, channel, and FPGA-based BVTC MREC receiver diagr am. for the resource and power usage report. Note that a stand- alone BF implementation takes about as many resources as order-1 MREC takes in the BVTC MREC implementation since these two designs are almost identical. Furthermore, MRC can be obtained from an MREC design by bypass- ing the KLT. Thus, an MREC design can easily be recon- figured (even during operation, on the fly) to implement BF or MRC instead. Implementation details are provided in Figure 4, for the case when the receiver implements BVTC adaptive MREC. For resource/power usage and performance evaluation, we model a typical ur ban scenario for realistic channel con- ditions from the base station perspective [16], and apply the conventional and enhanced receiver combining algo- rithms (after estimating channel gains and eigengains as in Section 2.6) to detect the transmitted symbols. Using MAT- LAB and Simulink, the actual log-normal distributed, time- correlated azimuth spread is simulated and then employed to compute the spatial correlation matrix, for realistic Lapla- cian power azimuth spectrum (p.a.s.) [16]—see Figure 4.In an actual embedded receiver, the channel correlation matrix and its eigenvalue decomposition could be updated by a pro- cessor (e.g., Altera’s soft-core FPGA-based Nios II). We se- lected a correlation update period of 0.14 second (denoted further as a frame, corresponding to a distance of roughly 2.3 m traveled by the mobile) since the azimuth spread re- mains relatively constant over this interval [16], providing the processor with sufficient time and uncorrelated samples for eigenstructure updating [3, 4]. The computed correlation matrix R h inputs a customized Simulink “multipath Rayleigh fading channel” block to simulate L = 4 correlated branches. The top subplot in Figure 5 depicts an azimuth spread se- quence generated using the model described in Section 2.2. The predominantly small-to-moderate azimuth spread val- ues indicate that we should often expect significant spatial correlation [1, 3], that is, small available diversity gain. Per- formance enhancement can then arise from BF antenna gain. Occasionally however, the azimuth spread can also become fairly large, but then the available diversity gain cannot ben- efit BF performance. On the other hand, significant diver- sity gain may be available too infrequently to justify perma- nent use of an MRC receiver. As we will see, an FPGA-based MREC receiver can provide, for a channel with slowly vary- ing statistics, flexibility that yields affordable performance. The main b enefit of an FPGA-based BVTC adaptive MREC receiver is that unnecessary eigenbranches can be virtually turned off using the clock gating technique [12]to reduce dynamic power loss, while necessary eigenbranches canbeimplementedtoruninparallel,forhighspeed.Ex- empting weak eigenbranches can also benefit performance [1]. Furthermore, as mentioned earlier, an MREC imple- mentation can easily be reduced to standalone BF or MRC implementations, if required, either at system setup or dur- ing operation. Constantin Sir iteanu et al. 9 Typical urban scenario: v = 60 km/h, d AS = 50 m 30 20 10 0 Azimuth spread (degrees) 0 50 100 150 200 Distance (m) ULA: L = 4, d n = 1; E s /N 0 = 5dB; f m = 0.01; SINC PSAM; M s = 7; T = 11 4 3 2 1 N from BVTC 02468101214 Time (s) 0.15 0.1 0.05 0 Average BER MRC, L = 1 BF BVTC MREC MRC, L = 4 Figure 5: Azimuth spread, MREC order selected with the BVTC, and BER performance (averaging over trial) for BF, MRC, and BVTC MREC. Altera documentation states that clock gating is avail- able only through lower-level (Quartus II) design. Therefore, clock gating was only emulated in DSP Builder, for the BVTC MREC implementation shown in Figure 4. First, nonadap- tive MREC designs with N = 1:4eigenbrancheswerecom- piled to determine their resource usage (shown in Table 2 ). Then, after each eigenstructure update during the BVTC MREC simulation, we stored the selected MREC orders and disconnected unused eigenbranches from the active struc- ture. Finally, average resource usage was computed. Figure 5 shows in the middle subplot the MREC order selected adap- tively using the BVTC, and in the lower subplot the BER av- eraged over the trial. Notice that for L = 4, MRC and BVTC adaptive MREC slightly outperfor m BF, and greatly outper- form the single-bra nch receiver. For the same typical urban scenario and system param- eters, Figure 6 shows resource usage, in percentage points of the total available, and dynamic power consumption, aver- aged over 8 trials. In each trial, the azimuth spread sam- ples are correlated, as described in Section 2.2, but the az- imuth spread sequences are independent between trials. Note that BF and BVTC MREC require a significantly smaller share of the FPGA programmable fabric, that is, LEs, com- pared to MRC (for L = 4), but more dedicated DSP blocks, due to KLT. The upper-right subplot appears to im- ply more chip pins demand for BF and MREC, because a MATLAB/Simulink-computed eigenvector matrix E N inputs the FPGA. Nevertheless, eigenstructure updating is possible with a soft processor, from within the FPGA. Figure 7 shows performance and total (dynamic + static) power used by a cellular operator’s large network of base stations similar to the one described in [11]. The single- branch receiver consumes least but performs poorly. For per- formance similar to BF and BVTC MREC, MRC (with L = 4) doubles the dynamic power loss (see also Figure 6(d)). Thus, BF and BVTC MREC appear to provide a better tradeoff.Re- call however that a compact ULA with d n = 1isconsidered. For larger interelement distances (feasible at base stations), MREC with more than one eigenbranch can significantly outperform BF [4]. Note that significant br anch correlation can occur even at mobile stations, due to limited antenna spacing, so that an FPGA-based BVTC MREC implementation employing clock gating can efficiently achieve near-optimum performance. Notice from Figure 5(b) that, frequently, only one or two (out of the four implemented) eigenbranches were actu- ally employed for MREC for that particular azimuth spread sequence. Similar results were obtained in other trials for independent azimuth spread sequences. This suggests that adaptive FPGA chip resource allocation among several ac- tive users may significantly increase base station user process- ing capacity, or, equivalently, reduce the required number of FPGA chips per base station, lowering both hardware cost and static power losses. A possible path towards such imple- mentations is described next. 4.3. Enhanced MREC receiver designs: the case of two users processed per FPGA chip EVTC-based adaptive MREC, described in Section 2.5,can provide more consistent use of the FPGA chip, compared to BVTC MREC. We propose to efficiently exploit a total of 3 eigenbranch processing modules, which fit into our FPGA, to process concurrently the signals received with L = 4 branches from two mobiles (without interference). Rather than per- manently allotting chip processing resources to a certain user (which may or may not need to use them, depending on channel conditions and required performance), herein we will adaptively deploy these resources to simultaneously de- tect the symbols transmitted from two mobiles. Resource usage information for EVTC MREC when N = 1 : 3 eigenbranches are selected can be found in Ta ble 2. Note that the BVTC and EVTC MREC implementations dif- fer significantly only in the required number of chip pins. The larger number of pins required for EVTC MREC (to in- put the received signals from two mobiles) limits to 3 the pos- sible number of implemented eigenbranches. Larger N e leads to unsuccessful compilation. Mutually independent azimuth spread sequences for the signals arriving at the base station from the two mobile stations were simulated, as shown in the top subplots of Figure 8. The MREC orders selected with the EVTC for each of the users are shown in the middle subplots. The lower subplots indicate that EVTC MREC can perform remarkably close to the enhanced receivers discussed previ- ously. Figure 9(a) indicates that our FPGA would not fit con- current four-branch MRC implementations for the two users. On the other hand, the successfully compiled two-user EVTC MREC implementation with N e = 3requiresabout half of the dynamic power consumed by MRC, for similar 10 EURASIP Journal on Embedded Systems 80 70 60 50 40 30 20 10 0 Logic elements (%) MRC, 1 BF BVTC MREC MRC, 4 50 40 30 20 10 0 Pins (%) MRC, 1 BF BVTC MREC MRC, 4 50 40 30 20 10 0 DSP blocks (%) MRC, 1 BF BVTC MREC MRC, 4 250 200 150 100 50 0 Dynamic power (mW) MRC, 1 BF BVTC MREC MRC, 4 Figure 6: Average resource and dynamic power usage for BF, BVTC MREC, and MRC, over 8 trials with mutually independent azimuth spread sequences. performance. Furthermore, since EVTC MREC allows for ef- fective concurrent processing of two users on a single FPGA, it yields a twofold reduction in static power consumption or a doubling of the base station user processing capacity. Thus, both implementation and oper a tional costs can be drastically reduced with EVTC MREC. Ideally, an FPGA-based embedded base station receiver would comprise: (1) a number of FPGAs programmed for KLT, channel estimation, signal combining, and symbol de- tection; (2) an embedded processor monitoring each user’s channel conditions (i.e., eigenmodes). At the beginning of each frame, the embedded processor browses a user hierar- chy, and allocates the FPGA resources so as to achieve de- sired performance for minimum resource/power consump- tion [3, 4]. Thus, it is possible that for a certain period, sev- eral users whose respective received signals are highly corre- lated will share the resources of a single FPGA because none of them will demand a large number of eigenbranches. If the azimuth spread for one of these users later widens sig- nificantly (yielding more available diversity gain) or if its SNR degra des (while a certain steady performance level is imposed), a larger share of the FPGA resources can be al- located accordingly. An FPGA-based embedded system for a performance- and a power-aware antenna array receivers can thus be flexibly implemented. 5. CONCLUSIONS We have described and implemented adaptive techniques that enhance the performance and reduce the power consumption for Altera-FPGA-based embedded wireless receivers. We found that smart antenna array receiver algo- rithms, for example, beamforming (BF) and maximal-ratio combining (MRC), outperform the conventional, single- branch receiver, but the performance gain may not always justify the additional implementation and oper ational costs. Tracking the slowly varying dominant channel eigenmodes, and using maximal-ratio eigencombining (MREC) is found to benefit more than BF and MRC from the parallelism and flexibility of FPGA-based implementation. For simi- lar performance, a twofold increase in user processing ca- pacity or decrease in power consumption is found possi- ble over MRC, for a typical urban scenario and 4 receiv- ing antennas. Adaptive MREC outperforms BF, for slightly [...]... vol 5, pp 2757–2761, Paris, France, June 2004 [3] C Siriteanu and S D Blostein, “Maximal-ratio eigencombining for smarter antenna arrays,” to appear in IEEE Transactions on Wireless Communication Constantin Siriteanu et al [4] C Siriteanu, “Maximal-ratio eigen-combining for smarter antenna arrays,” Ph.D dissertation, Queen’s University, Kingston, ON, Canada, September 2006 [5] M Guillaud, A Burg, M... long-term channel properties in space and time: eigenbeamforming concepts for the BS in WCDMA,” European Transactions on Telecommunications, vol 12, no 5, pp 365–378, 2001, special issue on Smart Antennas, http://www.chrisbrunner.org [21] J Jelitto and G Fettweis, “Reduced dimension space-time processing for multi -antenna wireless systems,” IEEE Wireless Communications, vol 9, no 6, pp 18–25, 2002 [22] F... Computer Engineering, Queen’s University, Kingston, Canada His Ph.D research has been in adaptive signal processing for smart antenna array receivers, with a focus on performance-complexity tradeoffs based on channel statistics Between 2004 and 2006, he has also been a Course Instructor for the 4th year undergraduate Electrical/Computer Engineering Project Course at Queen’s University Steven D Blostein... Siriteanu and S D Blostein, “Maximal-ratio eigencombining: a performance analysis,” Canadian Journal of Electrical and Computer Engineering, vol 29, no 1, pp 15–22, 2004 [2] C Siriteanu and S D Blostein, Smart antenna arrays for correlated and imperfectly-estimated Rayleigh fading channels,” in Proceedings of IEEE International Conference on Communications (ICC ’04), vol 5, pp 2757–2761, Paris, France,... usage (in percentage of total available) and dynamic power consumption for all discussed receiver algorithms, for two independent users higher resource consumption FPGA flexibility and wide range of on-chip resources can thus yield very efficient embedded implementations of adaptive receivers for current and future generations of wireless communications systems ACKNOWLEDGMENT The Altera Stratix FPGA development... Canadian Institute for Telecommunications Research He has also been a Consultant to industry and government in the areas of image compression and target tracking, and was a Visiting Associate Professor in the Department of Electrical Engineering at McGill University in 1995 His current interests lie in the application of signal processing to wireless communications systems, including smart antennas, MIMO... algorithm implementation for speech recognition systems,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’01), vol 2, pp 1217–1220, Salt Lake City, Utah, USA, May 2001 [9] T S T Mak and K P Lam, Embedded computation of maximum-likelihood phylogeny inference using platform FPGA,” in Proceedings of IEEE Computational Systems Bioinformatics Conference (CSB... consumption for BF, BVTC MREC, and MRC, over 8 independent azimuth spread trials 20 10 0 200 0 50 Distance (m) 2 1 200 2 1 0 5 10 0 5 Time (s) 10 Time (s) 0.15 0.1 0.1 BER 0.15 BER 150 3 N from EVTC 3 N from EVTC 100 Distance (m) 0.05 0 0.05 MRC, 1 BF BVTC EVTC MRC, 4 0 MRC, 1 BF BVTC EVTC Figure 8: Azimuth spread, EVTC MREC order, and average BER performance, for two users MRC, 4 12 EURASIP Journal on Embedded. .. Rappaport, Wireless Communications Principles and Practice, Prentice-Hall, Upper Saddle River, NJ, USA, 1996 [18] J K Cavers, “An analysis of pilot symbol assisted modulation for Rayleigh fading channels,” IEEE Transactions on Vehicular Technology, vol 40, no 4, pp 686–693, 1991 [19] M K Simon and M.-S Alouini, Digital Communication over Fading Channels: A Unified Approach to Performance Analysis, John... Jakes, Ed., Microwave Mobile Communications, John Wiley & Sons, New York, NY, USA, 1974 [15] J Proakis, Digital Communications, McGraw-Hill, Boston, Mass, USA, 4th edition, 2001 [16] A Algans, K I Pedersen, and P E Mogensen, “Experimental analysis of the joint statistical properties of azimuth spread, delay spread, and shadow fading,” IEEE Journal on Selected Areas in Communications, vol 20, no 3, . enhance the performance and reduce the power consumption for Altera -FPGA-based embedded wireless receivers. We found that smart antenna array receiver algo- rithms, for example, beamforming (BF). Corporation EURASIP Journal on Embedded Systems Volume 2006, Article ID 81309, Pages 1–13 DOI 10.1155/ES/2006/81309 FPGA-Based Communications Receivers for Smart Antenna Array Embedded Systems Constantin. eigen- combining for smarter antenna arrays,” to appear in IEEE Transactions on Wireless Communication. Constantin Sir iteanu et al. 13 [4] C. Siriteanu, “Maximal-ratio eigen-combining for smarter antenna arrays,”