Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 35 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
35
Dung lượng
2,66 MB
Nội dung
Softwar e Radio Arc hitecture: Object-Oriented Approac hes to Wireless Systems Engineering Joseph Mitola III Copyright c !2000 John Wiley & Sons, Inc. ISBNs: 0-471-38492-5 (Hardback); 0-471-21664-X (Electronic) 10 Digital Processing Tradeoffs This chapter addresses digital hardware architectures for SDRs. A digital hard- ware design is a configuration of digital building blocks. These include ASICs, FPGAs, ADCs, DACs, digital interconnect, digital filters, DSPs, memory, bulk storage, I/O channels, and/or general-purpose processors. A digital hardware architecture may be characterized via a reference platform, the minimum set of characteristics necessary to define a consistent family of designs of SDR hardware. This chapter develops the core technical aspects of digital hardware architecture by considering the digital building blocks. These insights permit one to characterize the architecture tradeoffs. From those tradeoffs, one may derive a digital reference platform capable of embracing the necessary range of digital hardware d esigns. The chapter begins with an overview of digital processing metrics and then describes each of the digital building blocks from the perspecti ve of its SDR architecture implications. I. METRICS Processors deli ver processing capacity to the radio software. The measure- ment of processing capacity is problematic. Candidate metrics for processing capacity are shown in Table 10-1. Each metric has strengths and limitations. One goal of architecture analysis is to define the relationship between these metrics and achievable performance of the SDR. The point of view employed is that one must predict the performance of an unimplemented software suite on an unimplemented hardware platform. One must then manage the compu- tational demands of the software against the benchmarked capacities of the hardware as the product is implemented. Finally, one must determine whether an existing software personality is compatible with an existing hardware suite. TABLE 10-1 Processing Metrics MIPS Millions of Instructions per Second MOPS Millions of Operations per S econd MFLOPS Millions of Floating Point Operations per Second Whetstone Supercomputing MFLOPS Benchmark Dhrystone Supercomputing MIPS Benchmark SPECmark SpecINT, SpecFP Instruction Mix Benchmarks (92 a nd 95) 312 METRICS 313 Consistent use of appropriate metrics assures that these tasks can be accom- plished without unpleasant surprises. 1. Differentiating the Metrics MIPS, MOPS, and MFLOPS are differentiated by logical scope. An operation (OP) is a logical transformation of the data in a designated element of hardware in one clock cycle. Processor architectures typically include hardware elements such as arithmetic and logic units (ALUs), multipliers, address generators, data caches, instruction caches, all operating in parallel at a synchronous clock rate. MOPS are obtained by multiplying the number of parallel hardware elements times the clock speed. If multiple operations are required to complete a machine instruction (e.g., a floating- point multiply), then MIPS = ® MOPS, ®< 1 If, on the other hand, the processor has a very long instruction word (VLIW), ® may be greater than 1. Suppose, for example, that a processor includes a “smart” cache, an ALU, and two parallel m ultiplier units w ith a 250 MHz system clock. One could characterize this processor in terms of the operations of the ALU and multipliers. If ® = 1, then it can deliver 250 " 3 or 750 MIPS, maximum. If the multipliers accomplish one 32-bit floating-point multiply on every clock cycle, then the processor provides 500 MFLOPS. Thus, one may characterize such a device as capable of a peak of 750 MIPS/500 MFLOPS. This notation means “750 MIPS of which up to 500 may be MFLOPS.” Digital filtering takes more floating-point operations than, say, protocol p rocessing, or FEC algorithms. If the SDR application uses a mix of 50% ALU and 50% floating point operations, then the processor delivers a maximum of 0 : 5 " 250 ALU operations plus 0 : 5 " 500 MFLOPS for a total of 125 + 250 = 375 MIPS. Clearly, processing capacity realized is a function of instruction mix. Alternatively, o ne could consider just the memory cache operations, at- tributing 250 MOPS of memory operations (MEOPS). If the memory cache operates fast enough so that the ALU and multipliers are never waiting for data or instructions, then the memory cache is not a bottleneck. If, however, there are states in which it must wait, then the potential 750 MIPS will not be realized. In this case, since MEOPS < MIPS, then the peak of 750 MIPS cannot be sustained beyond the capacity of the cache. For extremely com- putationally intensive operations like digital filtering, one may in fact realize the maximum capacity because all the data is resident in cache. Cache-misses then degrade performance. 2. Processor-Memory Interplay The execution of an instruction requires ac- cessing memory for instructions and data or accessing local registers. Pro- cessors that are more complex may fill a pipeline with instructions to be executed concurrently. Pipelines produce no results until the pipeline is full. Thereafter, pipelines produce a result per clock cycle. Newer architectures 314 DIGITAL PROCESSING TRADEOFFS may employ set-associative cache c oherency and other schemes to yield a higher number of instruction executions for a given clock speed. In addi- tion, there is statistical structure to the application, which will determine whether the data and instruction necessary at the next step will be in the cache (cache hit) or not (cache miss). Statistical structure is also present in the mix of input/output, data movement in memory, logical (e.g., masking and finding patterns), and arithmetic needed by an application. Some appli- cations like FFTs are very computationally intensive, requiring a high pro- portion of arithmetic instructions. Others such as supporting display windows require more copying of data from one part of memory to another. And sup- port of virtual memory requires the copying of pages of physical memory to hard disk or other large-capacity primary storage. This gives the programmer the illusion that physical memory is relatively unlimited (e.g., 32 gigabytes) within a physically confined space of, say, 128 Mbytes of physical mem- ory. 3. Standard Benchmarks Consequently, MIPS are hard to define. Often, the popular literature attributes MIPS based on a nonstatistical transformation of MOPS into instructions that could be executed in an ideal instruction mix . This approach makes the chip look as fast as it possibly could be. Since most manufacturers do this, the SDR engineer learns that achie vable per- formance on the given application will be significantly less than the nomi- nal MIPS rating. The manufacturer’s MIPS estimate is useful because it de- fines an upper bound to realizable performance. Most chips deliver 30 to 60% of such nominal MIPS as usable processing capacity in a realistic SDR mix. In the 1970s, scientists and engineers concerned with quantifying the ef- fectiveness of supercomputers developed the Whetstone, Dhrystone, and other benchmarks consisting of standard problem sets against which each new gen- eration of supercomputer could be assessed. These benchmarks focused on the central processor unit (CPU) and on the match between the CPU and the memory architecture in keeping data available for the CPU. But they did not address many of the aspects of computing that became important to prospective buyers of workstations and PCs. The speed with which the dis- play is updated is a key parameter of graphics applications, for example. The SPECmarks evolved during the 1990s to better address the concerns of the early-adopter buying public. Consequently, SPECmarks are informative but these also are not the ideal SDR metric in that they do not generally reflect the mix of instructions employed by SDR applications. Turletti [293], how- ever, has benchmarked a complete GSM base station u sing SPECmarks, as discussed further below. 4. SDR Benchmarks At this point, the reader may be expecting some new “SDR benchmark” to be presented as the ultimate weapon in choosing among new DSP chips. Unfortunately, one cannot define such a benchmark. First METRICS 315 Figure 10-1 Identify processing resources. of all, the radio performance depends on the interaction among the ASICs, DSP, digital interconnect, memory, mass storage, and the data-use structure of the radio application. These interactions are more fully addressed in Chapter 13 on performance management. It is indeed possible to reliably estimate the performance that will be achieved on the never-before-implemented SDR application. But the way to do this is not to blindly rely on a benchmark. Instead, one must analyze the hardware and software architecture (using the tools described later). One may then accurately capture the functional and statistical structure of the interactions among hardware and software. This systems analysis proceeds in the following steps: 1. Identify the processing resources. 2. Characterize the processing capacity of each class of digital hardware. 3. Characterize the processing demands of the software objects. 4. Determine how the capacity of the hardware supports the processing demands of the software by mapping the software objects onto the sig- nificant hardware partitions. There is a trap in identifying the hardware processor classes. ASICs and DSPs are easily identified as processing modules. But one must traverse each sig- nal processing path through the system to identify buses, shared memory, disks, general-purpose CPUs, and any other component that is on the path from source to destination (outside the system). Each such path is a process- ing thread. Each such processor has its own processing demand and priority structure against which the needs of the thread will be met. One then abstracts the block diagram into a set of critical resources, as illustrated in Figure 10-1. This chapter begins the process of characterizing the capacity of SDR hard- ware. It summarizes the tradeoffs among classes of processor, functional ar- chitecture, and special instruction sets. Other source material describes how to program them for typical DSP applications [294]. The extensive literature available on the web pursues detailed aspects of processors further [295–298]. The popular press provides p roduct highlights (e.g., [299–303]). This text, on the other hand, focuses on characterizing the processors with respect to the support of SDR applications. This is accomplished by the derivation of a dig- ital processing platform model that complements the RF platform developed previously. 316 DIGITAL PROCESSING TRADEOFFS TABLE 10-2 Mapping of Segments to Hardware Classes Segment Module Typical Performance Illustrative Manufacturers RF RF/IF HF, VHF, UHF Watkins Johnson, Steinbrecher IF ADC 1 to 70 Msa/sec Analog Devices (AD), Pentek IF Digital Rx 30.72 Mz Filters Harris Semiconductor, Graychip, Sharp IF Memory 64 MB at 40 MHz Harris, TRW IF, BB DSP 4 " 400 MFLOPS TI, AMD, Intel, Mercury, AD, Sky BS, SC Bus Host M68k, Pentium Motorola, Force, Intel SC Workstation 50 # 100 SPECmark 92 Sun, HP, DEC, Intel Legend: BB = baseband; BS = bitstream; SC = source coding. II. HETEROGENEOUS MULTIPR OCESSING HARDWARE Segment boundaries among antennas, RF, IF, baseband, bitstream, and source segments defined in the earlier chapters make it easy to map multiband, multi- mode, multiuser SDR personalities to parallel, pipelined, heterogeneous mul- tiprocessing hardware. A. Hardware Classes Some design strategies map radio functions to affordable open-architecture COTS hardware. In one example, the VME or PCI chassis hosts the RF, IF, baseband, and bitstream segments as illustrated in Table 10-2. The workstation hosts the OA&M, systems management, or research tools including the user interface, development tools, networking, and source coding/decoding. Each module shown in the table represents a class of hardware. The parameters of these modules that assure that a software personality will work properly are defined in the digital processing reference platform. Consider the roles of these hardware classes. The bus host serves as sys- tems control processor. The DSPs support the real-time channel-processing stream, sometimes configured as one DSP per N subscriber channels, where N typically ranges from 1 to 16. The path from the ADC to the first filter- ing/decimation stage may use a dedicated point-to-point mezzanine intercon- nect such as DT Connect TM , Data Translation. Customized FibreChannel and Transputer links have also been used. Synchronization of the block-by-block transfers across this bus with the point-by-point operations of the first fil- tering and decimation stage introduces inefficiencies that reduce throughput. Fan-out from IF processing to multiple baseband-processing DSPs also may be accomplished via a dedicated point-to-point path such as a m ezzanine bus. Alternatively, an open-architecture high-data-rate bus might be used. Instead of configuring such a heterogeneous multiprocessor at the board level, one might use a preconfigured system. Mercury TM , for example, has offered a mix of SHARC 21060 [304] (Analog Devices), PowerPC RISC, and HETEROGENEOUS MULTIPROCESSING HARDWARE 317 Figure 10-2 Alternative processing modules and interconnect. Intel i860 chips with Raceway interconnect [305–307]. Raceway I had nom- inally three paths at 160 MByte/sec interconnect capacity. Arrays of WE32’s were used in AT&T’s DSP-3 system. Arrays of i860’s were available from Sky Computer [308], CSPI [309], and others. Of particular note is UNISYS’ mil- itarized TOUCHSTONE processor, which was also based on the i860 [310]. Although the i860 is no longer a supported Intel product, the architectures are illustrative. System-on-a-chip level architectures also employ ASIC functions, shared memory, programmable logic arrays, and/or DSP cores. The physical packag- ing of these functions may be organized in point-to-point connections, buses, pipelines, or meshes. In each case, digital interconnect intervenes between functional building blocks and memory. Threads are traced from RF stimuli to analog and digital responses. Often in handsets, there is no ADC or DAC. Instead, RF ASICs perform channel modem functions to yield an alternative functional flow. Figure 10-2 contrasts these complementary views of interconnect and other hardware classes. The boundaries of the digital flow are the external interface components. These include the display drivers, audio ASICs, and I/O boards that access the PSTN. Tradeoffs among internal interconnect are addressed in the next section. B. Digital Inter connect Digital interconnect in systems-on-a-chip architectures is an emerging area. Over time, standards may emerge because of the need to integrate IP from a mix of suppliers on a single chip. Macroscale digital interconnect has a longer 318 DIGITAL PROCESSING TRADEOFFS Figure 10-3 Illustrative classes of digital interconnect. history of product evolution, and that is the focus of this discussion. These macroscale architectures may serve as precursors to future nanoscale on-chip interconnect. Illustrative approaches to digital interconnect for open-architecture process- ing nodes are the dedicated interconnect, wideband bus, and shared memory (Figure 10-3). 1. Dedicated Interconnect Dedicated interconnect is typically available from subsystem suppliers like Pentek [311]. Pentek provides 70 MHz ADC boards and Harris or Graychip digital receiver boards. Its MIX TM bus interconnects these cards efficiently. In addition, if the set of boards and interconnect does not work, the vendor resolves the issues. This approach leverages COTS prod- ucts, with low cost and low risk. For applications with relatively small numbers of IF channels, it represents a solid engineering approach. 2. Wideband Bus The next step up in technical sophistication is the wide- band bus. The SCI bus [312], for example, has been used in supercomputer systems for several years. It is becoming available in turnkey formats includ- ing interface chip sets. The gigabyte-per-second capacity of the SCI bus could continue to increase with the underlying device technology. In addition, the design scales up easily to 8 " 140 MBps channels. The MIX bus, DT Connect, Raceway, SkyChannel [313], and other lower-capacity designs may be con- figured in parallel to attain high aggregate rates. This requires the hardware components to be appropriately partitioned. Other high-speed bus technologies are emerging, such as Vertical Laser at 115 GHz [314, 315]. 3. Shared Memory Shared memory can deliver the ultimate in interconnect bandwidth. Bulk memory of 64 MBytes easily has 16- to 64-bit paths. Scaling to 128 or 256 bits is feasible. Clock rates of 25 to 250 MHz are within reach. Thus, aggregate throughput of 3.2 to 64 gigabytes per second are becoming HETEROGENEOUS MULTIPROCESSING HARDWARE 319 Figure 10-4 Wideband ADC rate versus interconnect complexity. practicable with 4 ported shared memory. As the number of ports increases above 4, clock contention drives throughput down. But the switching, blocking and routing of data streams need not degrade throughput if the shared memory is supported by programmable direct memory access (DMA) or equivalent hardware. If only two very wideband input streams and two output streams need to be interconnected simultaneously (possibly out of a choice of 4 o r 8), the shared memory architecture may be the best choice. Shared memory historically has the greatest performance, design/development cost, and risk of these approaches to digital interconnect. 4. SDR Applications As illustrated in Figure 10-4, the ADC drives the dig- ital interconnect architecture. Considering only the ADC’s output data rate (in millions of bytes per second) and the nominal capacity of typical buses, the figure shows the relationship between aggregate ADC rate and number of buses. One 40 MByte per second VME bus can support a 3 MByte per second ADC s tream using less than 1/10 of its c apacity. As data rates increase, multi- ple buses and/or buses of greater bandwidth must be used to support the data rate. The 600 MByte per second ADC rate represents two bytes of resolution at 300 MHz, while the 500 MHz ADC has only one byte of resolution in this example. Interconnect efficiency is usually a function of the size of the data blocks being transferred. DMA transfers require setup, an overhead task that detracts from overall throughput. Buses also have bus-associated handshaking that constitutes overhead. 320 DIGITAL PROCESSING TRADEOFFS Figure 10-5 Interconnect efficiency. Most buses experience low throughput for small block sizes. Mercury char- acterizes the performance of its products thoroughly. The maximum sustain- able transfer rate of Raceway I varies as a function of DMA block length as illustrated in Figure 10-5. Although the peak rate of 160 MB/sec is not sus- tainable, it is approached with block sizes above 4096 bytes. Some devices (e.g., ADCs) may have short on-board buffers, constraining blocks to smaller sizes. In addition, algorithm constraints may proscribe smaller block sizes. A 0.5 ms GSM frame, digitized at 500 k samples per second, for example, may be processed with a block size of 250 samples (500 Bytes). If presented to Raceway in that format, the sustainable throughput would fall between 80 and 120 MB/sec as shown in the figure. If this is understood, then a constraint can be established between the algorithm and Raceway as an interconnect module. Constraint-management software can then assure that the capacity of the in- terconnect is not exceeded when instantiating a waveform into such hardware. In a more representative example, the entire bandwidth of the GSM allocation could be sampled at 50 M samples/sec, yielding 25.5 k samples per GSM frame, or over 50 kBytes. This data could be efficiently transferred to digital filter ASICs in 8 kByte blocks. 5. Architecture Implications The physical format of digital interconnect (e.g., PCI, VME, etc.) need not be incorporated into an open-architecture standard for SDR. The less specific standard encourages competition and tech- nology insertion by not unnecessarily constraining the implementations. On APPLICATIONS-SPECIFIC INTEGRATED CIRCUITS (ASICS) 321 the other hand, such an architecture must recognize the fact that each class of physical interconnect entails implementation-specific constraints. An open architecture that supports multivendor product integration therefore must char- acterize those constraints to assure that software is installed on hardware with the necessary interconnect capabilities. Otherwise, interconnect capacity may become the system bottleneck that causes the node to fail or degrade unex- pectedly. An architecture standard used by a large enterprise to establish product migration paths, on the other hand, should specify the digital interconnect (e.g., PCI) and its migration from one physical realization to others as technology matures. III. APPLICATIONS-SPECIFIC INTEGRATED CIRCUITS (ASICs) The next step in the digital flow from the ADC to the back-end processors in a base station is typically a pool of ASICs. ASICs particularly suited to software radios i nclude d igital filters, FEC, and hybrid analog-digital RF-transceiver modules with programmable capabilities. Waveform-specific ASICs ar e ex- hibiting increased programmability, mixing the capabilities of digital filters, FEC, and general-purpose processors for new classes of waveform (e.g., W- CDMA). In addition, DSP cores with custom on-chip capabilities are ASICs, but for clarity, they are addressed in the section on DSP architectures. A. Digital Filter ASICs Base station architectures need digital frequency translation and filtering for hundreds of simultaneous users. Minimum distortion and nonlinearities are re- quired in the base-station receiver architecture to meet near–far requirements. Digital-filter ASICs therefore extract weak signals in the presence of strong signals. The architecture for such ASICs is illustrated in Figure 10-6. The fre- quency and phase of the ASIC is set so that the complex multiply-accumulator chip (CMAC) translates the wideband input to a programmable baseband. For first-generation cellular applications, the decimating digital filters (DDFs) yielded 25 or 30 kHz narrowband voice channels through computationally intensive filtering. Hogenaur realized that adjustment of the integrator, comb, and decima- tor parameters reduces aliasing as illustrated in Figure 10-7 [316]. Aliasing bands are folded into baseband at the complex sampling frequency. Choice of decimation rate and comb filter parameters places a deep null in the band of interest, achieving 90 dB of dynamic range using limited-precision inte- ger arithmetic. The Hogenaur filter thus facilitated the efficient realization of the Harris ASICs. The product-line evolved to the HSP series now owned by Intersil. Oh [317] has proposed the use of interpolated second-order polynomials as an improvement over the Hogenaur filter. Graychip has also been develop- [...]... of the shift register) All of the constraints may be enforced without user intervention if the computational demands of the radio application are compatible with the resources of the hardware platform But the satisfaction of such constraints is only the first step in addressing potential conflicts between the personality and the platform Some INFOSEC design rules, for example, preclude the use of one... appropriate data-access topologies A well-conceived software -radio architecture therefore must support the insertion of FPGA technology as opportunities emerge FPGAs may be accessed via tunneling as described above In addition, however, SDR architecture must include FPGAs with multiple personalities Srikanteswara et al., [331, 332] envision such soft radios as structured into the four layers illustrated in... the soft radio interface layer This layer also returns processed data and error messages to the applications layer This uppermost layer controls the architecture parameters, provides data from the ADC, and delivers results to the host processor, user, etc This stack forms a subset of the layered virtual machine architecture as illustrated in Figure 10-18 In this model of architecture, the radio applications... digital input stream into a request to the FPGA-specific soft radio interface (SRI) layer The SRI then behaves according to the packet-driven layering described by Srikanteswara et al The SRI translates the Digital Filter( ) call into a bit pattern FIELD-PROGRAMMABLE GATE ARRAYS (FPGAS) Figure 10-17 Layered architecture for FPGA-based “soft radios.” Figure 10-18 Layered virtual machine architecture embeds... architecture includes programmable INFOSEC [437] VII HOST PROCESSORS UNIX VME hosts have been used in SPEAKeasy and other software radio technology pathfinders In addition, the Motorola 601 and 604 series of general-purpose processors have been used for both DSP and general-purpose radio applications In E-Systems’ CellTapTM series of first-generation cellular law-enforcement products, the IBM PC Industry... architecture The continued validity of Moore’s Law will continue to transform radios into computers with antennas Although the layering represented in Figure 10-22 is excruciatingly computationally intensive, MIPS are becoming cheap at an exponential rate The layered architecture presented in the figure therefore insulates radio applications from the rapid hardware evolution In the near-term, these... rate is reduced by 1 between successive stages Each lattice requires 2 two multipliers and two adders, so two such stages can be implemented in parallel in each of the two processors (4 : 1 parallelism potential) Since all but the first stage is decimated by multiples of 1 , the last seven stages can be 2 hosted on a single pair of multiply-accumulator resources in a processor With an input rate of 7... scope To reconfigure the FPGA to do other tasks while waiting for the disk is possible This approach runs the risk that the processor will be configured the wrong way when the disk returns the data A radio control algorithm, for example, could access any data in the system The uncertainty about the arrival of control tasks puts a premium on processing interrupts efficiently General-purpose processors... data-instruction combinations before repeating Such algorithms therefore place high reconfiguration demands on FPGAs A research breakthrough seems necessary to change this situation In a limited-scope digital radio application, one could reconfigure the processor to filter the signal, then again to demodulate it and then again to perform FEC An FPGA should perform well within such a limited scope of processing... pruned from the population After sufficient training, the survivors could be robust and nearly optimal One advantage of this approach is that it substitutes machine learning for labor-intensive design, potentially saving time and cost One disadvantage is the large number of data sets that must be processed by a large community of competing modems before the winner(s) emerge If that disadvantage can be . resources. of all, the radio performance depends on the interaction among the ASICs, DSP, digital interconnect, memory, mass storage, and the data-use structure of the radio application. These. computational demands of the radio application are compatible with the resources of the hardware platform. But the satisfaction of such constraints is only the first step in addressing potential conflicts. memory cache is not a bottleneck. If, however, there are states in which it must wait, then the potential 750 MIPS will not be realized. In this case, since MEOPS < MIPS, then the peak of 750