Real-Time Digital Signal Processing - Chapter 2: Introduction to TMS320C55x Digital Signal Processor

Real-Time Digital Signal Processing Sen M Kuo, Bob H Lee Copyright # 2001 John Wiley & Sons Ltd ISBNs: 0-470-84137-0 (Hardback); 0-470-84534-1 (Electronic) Introduction to TMS320C55x Digital Signal Processor Digital signal processors with architecture and instructions specifically designed for DSP applications have been launched by Texas Instruments, Motorola, Lucent Technologies, Analog Devices, and many other companies DSP processors are widely used in areas such as communications, speech processing, image processing, biomedical devices and equipment, power electronics, automotive, industrial electronics, digital instruments, consumer electronics, multimedia systems, and home appliances To efficiently design and implement DSP systems, we must have a solid knowledge of DSP algorithms as well as a basic concept of processor architecture In this chapter, we will introduce the architecture and assembly programming of the Texas Instruments TMS320C55x fixed-point processor 2.1 Introduction Wireless communications, telecommunications, medical, and multimedia applications are developing rapidly Increasingly traditional analog devices are being replaced with digital systems The fast growth of DSP applications is not a surprise when considering the commercial advantages of DSP in terms of the potentially fast time to market, flexibility for upgrades to new technologies and standards, and low design cost offered by various DSP devices The rising demand from the digital handheld devices in the consumer market to the digital networks and communication infrastructures coupled with the emerging internet applications are the driving forces for DSP applications In 1982, Texas Instruments introduced its first general-purpose fixed-point DSP device, the TMS32010, to the consumer market Since then, the TMS320 family has extended into two major classes: the fixed-point and floating-point processors The TMS320 fixed-point family consists of C1x, C2x, C5x, C2xx, C54x, C55x, C62x, and C64x The TMS320 floating-point family includes C3x, C4x, and C67x Each generation of the TMS320 series has a unique central processing unit (CPU) with a variety of memory and peripheral configurations In this book, we chose the TMS320C55x as an example for real-time DSP implementations, applications, and experiments 36 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR The C55x processor is designed for low power consumption, optimum performance, and high code density Its dual multiply±accumulate (MAC) architecture provides twice the cycle efficiency computing vector products ± the fundamental operation of digital signal processing, and its scaleable instruction length significantly improves the code density In addition, the C55x is source code compatible with the C54x This greatly reduces the migration cost from the popular C54x based systems to the C55x systems Some essential features of the C55x device are listed below: Upward source-code compatible with all TMS320C54x devices 64-byte instruction buffer queue that works as a program cache and efficiently implements block repeat operations Two 17-bit by 17-bit MAC units can execute dual multiply-and-accumulate operations in a single cycle A 40-bit arithmetic and logic unit (ALU) performs high precision arithmetic and logic operations with an additional 16-bit ALU performing simple arithmetic operations parallel to the main ALU Four 40-bit accumulators for storing computational results in order to reduce memory access Eight extended auxiliary registers for data addressing plus four temporary data registers to ease data processing requirements Circular addressing mode supports up to five circular buffers Single-instruction repeat and block repeat operations of program for supporting zero-overhead looping Detailed information about the TMS320C55x can be found in the manufacturer's manuals listed in references [1±6] 2.2 TMS320C55x Architecture The C55x CPU consists of four processing units: an instruction buffer unit (IU), a program flow unit (PU), an address-data flow unit (AU), and a data computation unit (DU) These units are connected to 12 different address and data buses as shown in Figure 2.1 2.2.1 TMS320C55x Architecture Overview Instruction buffer unit (IU): This unit fetches instructions from the memory into the CPU The C55x is designed for optimum execution time and code density The instruction set of the C55x varies in length Simple instructions are encoded using eight bits 37 TMS320C55X ARCHITECTURE 24-bit program-read address bus (PAB) 32-bit program-read data bus (PB) Three 24-bit data-read address buses (BAB, CAB, DAB) Three 16-bit data-read data buses (BB, CB, DB) 32 bits CB Instruction buffer unit Program flow unit Address data flow unit (IU) (PU) (AU) DB BB CB DB Data computation unit (DU) C55x CPU Two 16-bit data-write data buses (EB, FB) Two 24-bit data-write address buses (EAB, FAB) Figure 2.1 Block diagram of TMS320C55x CPU Program-read data bus (PB) 32 (4-byte opcode fetch) IU Instruction buffer queue (64 bytes) 48 (1-6 bytes opcode) Instruction decoder PU AU DU Figure 2.2 Simplified block diagram of the C55x instruction buffer unit (one byte), while more complicated instructions may contain as many as 48 bits (six bytes) For each clock cycle, the IU can fetch four bytes of program code via its 32-bit program-read data bus At the same time, the IU can decode up to six bytes of program After four program bytes are fetched, the IU places them into the 64-byte instruction buffer At the same time, the decoding logic decodes an instruction of one to six bytes previously placed in the instruction decoder as shown in Figure 2.2 The decoded instruction is passed to the PU, the AU, or the DU The IU improves the efficiency of the program execution by maintaining a constant stream of instruction flow between the four units within the CPU If the IU is able to 38 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR hold a segment of the code within a loop, the program execution can be repeated many times without fetching additional code Such a capability not only improves the loop execution time, but also saves the power consumption by reducing program accesses from the memory Another advantage is that the instruction buffer can hold multiple instructions that are used in conjunction with conditional program flow control This can minimize the overhead caused by program flow discontinuities such as conditional calls and branches Program flow unit (PU): This unit controls DSP program execution flow As illustrated in Figure 2.3, the PU consists of a program counter (PC), four status registers, a program address generator, and a pipeline protection unit The PC tracks the C55x program execution every clock cycle The program address generator produces a 24-bit address that covers 16 Mbytes of program space Since most instructions will be executed sequentially, the C55x utilizes pipeline structure to improve its execution efficiency However, instructions such as branches, call, return, conditional execution, and interrupt will cause a non-sequential program address switch The PU uses a dedicated pipeline protection unit to prevent program flow from any pipeline vulnerabilities caused by a non-sequential execution Address-data flow unit (AU): The address-data flow unit serves as the data access manager for the data read and data write buses The block diagram illustrated in Figure 2.4 shows that the AU generates the data-space addresses for data read and data write It also shows that the AU consists of eight 23-bit extended auxiliary registers (XAR0± XAR7), four 16-bit temporary registers (T0±T3), a 23-bit extended coefficient data pointer (XCDP), and a 23-bit extended stack pointer (XSP) It has an additional 16bit ALU that can be used for simple arithmetic operations The temporary registers may be utilized to expand compiler efficiency by minimizing the need for memory access The AU allows two address registers and a coefficient pointer to be used together for processing dual-data and one coefficient in a single clock cycle The AU also supports up to five circular buffers, which will be discussed later Data computation unit (DU): The DU handles data processing for most C55x applications As illustrated in Figure 2.5, the DU consists of a pair of MAC units, a 40-bit ALU, four 40-bit accumulators (AC0, AC1, AC2, and AC3), a barrel shifter, rounding and saturation control logic There are three data-read data buses that allow two data paths and a coefficient path to be connected to the dual-MAC units simultaneously In a single cycle, each MAC unit can perform a 17-bit multiplication Program-read address bus (PAB) 24-bit PU Program counter (PC) Status registers (ST0, ST1, ST2, ST3) Address generator Pipeline protection unit Figure 2.3 Simplified block diagram of the C55x program flow unit 39 TMS320C55X ARCHITECTURE D A T A M E M O R Y CB DB EB FB DAB EAB FAB 23-bit T0 XAR0 T1 XAR1 T2 XAR2 T3 XAR3 16-bit ALU BAB CAB S P A C E 16-bit AU Data address generator unit (24-bit) XAR4 XAR5 XAR6 XAR7 XCDP XSP Figure 2.4 Simplified block diagram of the C55x address-data flow unit BB DU 16-bit AC0 AC1 AC2 CB AC3 16-bit MAC DB MAC ALU (40-bit) Barrel Shifter Overflow & Saturation EB 16-bit FB 16-bit 16-bit Figure 2.5 Simplified block diagram of the C55x data computation unit and a 40-bit addition or subtraction operation with a saturation option The ALU can perform 40-bit arithmetic, logic, rounding, and saturation operations using the four accumulators It can also be used to achieve two 16-bit arithmetic operations in both the upper and lower portions of an accumulator at the same time The ALU can accept immediate values from the IU as data and communicate with other AU and PU registers The barrel shifter may be used to perform a data shift in the range of 32 (shift right 32-bit) to 231 (shift left 31-bit) 2.2.2 TMS320C55x Buses As illustrated in Figure 2.1, the TMS320C55x has one 32-bit program data bus, five 16bit data buses, and six 24-bit address buses The program buses include a 32-bit program-read data bus (PB) and a 24-bit program-read address bus (PAB) The PAB carries the program memory address to read the code from the program space The unit of program address is in bytes Thus the addressable program space is in the range of 40 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR 0x000000±0xFFFFFF (the prefix 0x indicates the following number is in hexadecimal format) The PB transfers four bytes of program code to the IU each clock cycle The data buses consist of three 16-bit data-read data buses (BB, CB, and DB) and three 24-bit data-read addresses buses (BAB, CAB, and DAB) This architecture supports three simultaneous data reads from data memory or I/O space The C bus and D buses (CB and DB) can send data to the PU, AU, and DU; while the B bus (BB) can only work with the DU The primary function of the BB is to connect memory to a dualMAC; so some specific operations can access all three data buses, such as fetching two data and one coefficient The data-write operations are carried out using two 16-bit data-write data buses (EB and FB) and two 24-bit data-write address buses (EAB and FAB) For a single 16-bit data write, only the EB is used A 32-bit data write will use both the EB and FB in one cycle The data-write address buses (EAB and FAB) have the same 24-bit addressing range Since the data access uses a word unit (2-byte), the data memory space becomes 23-bit word addressable from address 0x000000 to 0x7FFFFF The C55x architecture is built around these 12 buses The program buses carry the instruction code and immediate operands from program memory, while the data buses connect various units This architecture maximizes the processing power by maintaining separate memory bus structures for full-speed execution 2.2.3 TMS320C55x Memory Map The C55x uses a unified program, data, and I/O memory configurations All 16 Mbytes of memory are available as program or data space The program space is used for instructions and the data space is used for general-purpose storage and CPU memory mapped registers The I/O space is separated from the program/data space, and is used for duplex communication with peripherals When the CPU fetches instructions from the program space, the C55x address generator uses the 24-bit program-read address bus The program code is stored in byte units When the CPU accesses data space, the C55x address generator masks the least-significant-bit (LSB) of the data address since data stored in memory is in word units The 16 Mbytes memory map is shown in Figure 2.6 Data space is divided into 128 data pages (0±127) Each page has 64 K words The memory block from address to 0x5F in page is reserved for memory mapped registers (MMRs) 2.3 Software Development Tools The manufacturers of DSP processors typically provide a set of software tools for the user to develop efficient DSP software The basic software tools include an assembler, linker, C compiler, and simulator As discussed in Section 1.4, DSP programs can be written in either C or assembly language Developing C programs for DSP applications requires less time and effort than those applications using assembly programs However, the run-time efficiency and the program code density of the C programs are generally worse than those of the assembly programs In practice, high-level language tools such 41 SOFTWARE DEVELOPMENT TOOLS Data space addresses word in Hexadecimal MMRs 00 0000-00 005F 00 0060 Page  00 FFFF 01 0000 Page  01 FFFF 02 0000 Page  02 FFFF Page 127  C55x memory program /data space Program space addresses byte in Hexadecimal 00 00 01 02 0000-00 00BF Reserved 00C0 FFFF 0000 03 FFFF 04 0000 05 FFFF 7F 0000 FE 0000 7F FFFF FF FFFF Figure 2.6 TMS320C55x program space and data space memory map as MATLAB and C are used in early development stages to verify and analyze the functionality of the algorithms Due to real-time constraints and/or memory limitations, part (or all) of the C functions have to be replaced with assembly programs In order to execute the designed DSP algorithms on the target system, the C or assembly programs must first be translated into binary machine code and then linked together to form an executable code for the target DSP hardware This code conversion process is carried out using the software development tools illustrated in Figure 2.7 The TMS320C55x software development tools include a C compiler, an assembler, a linker, an archiver, a hex conversion utility, a cross-reference utility, and an absolute lister The debugging tools can either be a simulator or an emulator The C55x C compiler generates assembly code from the C source files The assembler translates assembly source files; either hand-coded by the engineers or generated by the C compiler, into machine language object files The assembly tools use the common object file format (COFF) to facilitate modular programming Using COFF allows the programmer to define the system's memory map at link time This maximizes performance by enabling the programmer to link the code and data objects into specific memory locations The archiver allows users to collect a group of files into a single archived file The linker combines object files and libraries into a single executable COFF object module The hex conversion utility converts a COFF object file into a format that can be downloaded to an EPROM programmer In this section, we will briefly describe the C compiler, assembler, and linker A full description of these tools can be found in the user's guides [2,3] 42 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Macro source files Archiver Macro library C source files Assembly source files Assembler Archiver Library of object files COFF object files Linker Hex converter COFF executable file EPROM programmer TMS320C55x target Figure 2.7 C compiler Library-build utility Run-time support libraries Debugger Absolute lister X-reference lister TMS320C55x software development flow and tools 2.3.1 C Compiler As mentioned in Chapter 1, C language is the most popular high-level tool for evaluating DSP algorithms and developing real-time software for practical applications The TMS320C55x C compiler translates the C source code into the TMS320C55x assembly source code first The assembly code is then given to the assembler for generating machine code The C compiler can generate either a mnemonic assembly code or algebraic assembly code Table 2.1 gives an example of the mnemonic and algebraic assembly code generated by the C55x compiler In this book, we will introduce only the widely used mnemonic assembly language The C compiler package includes a shell program, code optimizer, and C-to-ASM interlister The shell program supports automatic compile, assemble, and link modules The optimizer improves run-time and code density efficiency of the C source files The C-to-ASM interlister inserts the original comments in C source code into the compiler's output assembly code; so the user can view the corresponding assembly instructions generated by the compiler for each C statement The C55x compiler supports American National Standards Institute (ANSI) C and its run-time-support library The run-time support library, rts55.lib, includes functions to support string operation, memory allocation, data conversion, trigonometry, and exponential manipulations The CCS introduced in Section 1.5 has made using DSP development tools (compiler, assembly, and linker) easier by providing default setting 43 SOFTWARE DEVELOPMENT TOOLS Table 2.1 An example of C code and the C55x compiler generated assembly code Code in_buffer[i] sineTable[i]; Mnemonic assembly code Algebraic assembly code mov *SP(#0), AR2 AR2 *SP(#0) add #_sineTable, AR2 AR2 AR2 #_sineTable mov *SP(#0), AR3 AR3 *SP(#0) add #_in_buffer, AR3 AR3 AR3 #_in_buffer mov *AR2, *AR3 *AR3 *AR2 parameters and prompting the options It is still beneficial for the user to understand how to use these tools individually, and set parameters and options from the command line correctly We can invoke the C compiler from a PC or workstation shell by entering the following command: c155 [-options] [filenames] [-z[link_options] [object_files]] The filenames can be one or more C program source files, assembly source files, object files, or a combination of these files If we not supply an extension, the compiler assumes the default extension as c, asm, or obj The -z option enables the linker, while the -c option disables the linker The link_options set up the way the linker processes the object files at link time The object_files are additional objective files for the linker to add to the target file at link time The compiler options have the following categories: The options that control the compiler shell, such as the -g option that generates symbolic debug information for debugging code The options that control the parser, such as the -ps option that sets the strict ANSI C mode for C The options that are C55x specific, such as the -ml option that sets the large memory model The options that control the optimization, such as the -o0 option that sets the register optimization The options that change the file naming conventions and specify the directories, such as the -eo option that sets the default object file extension The options that control the assembler, such as the -al option that creates assembly language listing files The options that control the linker, such as the -ar option that generates a relocatable output module 44 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR There are a number of options in each of the above categories Refer to the TMS320C55x Optimizing C Compiler User's Guide [3] for detailed information on how to use these options The options are preceded by a hyphen and are not case sensitive All the single letter options can be combined together, i.e., the options of -g, -k, and -s, are the same as setting the compiler options as -gks The two-letter operations can also be combined if they have the same first letter For example, setting -pl, -pk, and -pi three options are the same as setting the options as -plki C language lacks specific DSP features, especially those of fixed-point data operations that are necessary for many DSP algorithms To improve compiler efficiency for real-time DSP applications, the C55x compiler provides a method to add in-line assembly language routines directly into the C program This allows the programmer to write highly efficient assembly code for the time-critical sections of a program Intrinsic is another improvement for users to substitute DSP arithmetic operation with assembly intrinsic operators We will introduce more compiler features in Section 2.7 when we present the mixing of C and assembly programs In this chapter, we emphasize assembly language programming 2.3.2 Assembler The assembler translates processor-specific assembly language source files (in ASCII text) into binary COFF object files for specific DSP processors Source files can contain assembler directives, macro directives, and instructions Assembler directives are used to control various aspects of the assembly process such as the source file listing format, data alignment, section content, etc Binary object files contain separate blocks (called sections) of code or data that can be loaded into memory space Assembler directives are used to control the assembly process and to enter data into the program Assembly directives can be used to initialize memory, define global variables, set conditional assembly blocks, and reserve memory space for code and data Some of the most important C55x assembler directives are described below: BSS directive: The bss directive reserves space in the uninitialized bss section for data variables It is usually used to allocate data into RAM for run-time variables such as I/O buffers For example, bss xn_buffer, size_in_words where the xn_buffer points to the first location of the reserved memory space, and the size_in_words specifies the number of words to be reserved in the bss section If we not specify uninitialized data sections, the assembler will put all the uninitialized data into the bss section .DATA directive: The data directive tells the assembler to begin assembling the source code into the data section, which usually contains data tables or pre-initialized variables such as sinewave tables The data sections are word addressable .SECT directive: The sect directive defines a section and tells the assembler to begin assembling source code or data into that section It is often used to separate long programs into logical partitions It can separate the subroutines from the main program, or separate constants that belong to different tasks For example, ... locations in order to evaluate the real-time results using a DSP board Emulators allow the DSP software to run at full-speed in a real-time environment 2.3.5 Assembly Statement Syntax The TMS320C55x. .. mode to variable x 54 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR auxiliary registers for dual data memory access The coefficient data pointer (CDP) indirect mode uses the CDP to point to. .. be used to initialize the memory using linker command files [2] 48 INTRODUCTION TO TMS320C55X DIGITAL SIGNAL PROCESSOR Table 2.2 Example of a linker command file used for the C55x simulator /*

Định dạng
Số trang	42
Dung lượng	259,03 KB