ABSTRACT Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions. There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA). Most of existing FPGAbased TCAM designs are based on bruteforce implementations, which result in inefficient onchip resource usage. As a result, existing designs support only a small TCAM size even with large FPGA devices. They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding. This paper presents a scalable random access memory (RAM)based TCAM architecture aiming for efficient implementation on stateoftheart FPGAs. We give a formal study on RAMbased TCAM to unveil the ideas and the algorithms behind it. To conquer the timing challenge, we propose a modular architecture consisting of arrays of smallsize RAMbased TCAM units. After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units. This leads to resource saving. The capability of explicit range matching is also offered to avoid rangetoternary conversion for search functions that require range matching. Implementation on a Xilinx Virtex 7 FPGA shows that our design can support a large TCAM of up to 2:4 Mbits while sustaining high throughput of 150 million packets per second. The resource usage scales linearly with the TCAM size. The architecture is configurable, allowing various performance tradeoffs to be exploited. To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than 1 Mbits. Categories and Subject Descriptors C.1.4 Processor Architectures: Parallel Architectures; C.2.6 Computer Communication Networks: Internetworking General Terms Algorithms, Design, Performance Keywords FPGA; RAM; TCAM 1. INTRODUCTION Ternary Content Addressable Memory (TCAM) is a specialized associative memory where each bit can be 0, 1, or don’t care (i.e. ∗). TCAM has been widely used in network infrastructure for various search functions including longest prefix matching (LPM), multifield packet classification, etc. For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle. A priority encoder is needed to obtain the index of the matching word with the highest priority. In a TCAM, the physical location normally determines the priority, e.g. the top word has the highest priority. Most of current TCAMs are implemented as a standalone applicationspecific integrated circuit (ASIC). We call them the native TCAMs. Native TCAMs are expensive, powerhungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs). The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g. OpenFlow 1) where the width andor the depth for different lookup tables can be variable 2. Various algorithmic solutions have been proposed as alternatives to native TCAMs. But none of them is exactly equivalent to TCAM. The success of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM. For some other search functions such as multifield packet classification, the algorithmic solutions 3, 4 employ various heuristics, leading to undeterministic performance that is often dependent on the characteristics of the data set. On the other hand, reconfigurable hardware such as fieldprogrammable gate array (FPGA) combines the flexibility of software and the nearASIC performance. Stateoftheart FPGA devices such as Xilinx Virtex7 5 and Altera StratixV 6 provide high clock rate, low power dissipation, rich onchip resources and large amounts of embedded memory with configurable word width. Due to their increasing capacity, modern FPGAs have been an attractive option for implementing various networking functions 7, 8, 9, 10. Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower development cost and the shrinking performance gap between it and ASIC. Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAMequivalent search engines. While there exist several FPGAbased TCAM designs, most of them are based on bruteforce implementations to mimic the native TCAM architecture. Their resouce usage is inefficient, which makes them less interesting in practice. On the other hand, some recent work 11, 12, 13 shows that RAMs can be employed to emulateimplement a TCAM.
Scalable Ternary Content Addressable Memory Implementation Using FPGAs Weirong Jiang Xilinx Research Labs San Jose, CA, USA weirongj@acm.org ABSTRACT “don’t care” (i.e “∗”) TCAM has been widely used in network infrastructure for various search functions including longest prefix matching (LPM), multi-field packet classification, etc For each input key, TCAM performs parallel search over all stored words and finds out the matching word(s) in a single clock cycle A priority encoder is needed to obtain the index of the matching word with the highest priority In a TCAM, the physical location normally determines the priority, e.g the top word has the highest priority Most of current TCAMs are implemented as a standalone application-specific integrated circuit (ASIC) We call them the native TCAMs Native TCAMs are expensive, powerhungry, and not scalable with respect to clock rate or circuit area, especially compared with Random Access Memories (RAMs) The limited configurability of native TCAMs does not fit the requirement of some network applications (e.g OpenFlow [1]) where the width and/or the depth for different lookup tables can be variable [2] Various algorithmic solutions have been proposed as alternatives to native TCAMs But none of them is exactly equivalent to TCAM The success of algorithmic solutions is limited to a few specific search functions such as exact matching and LPM For some other search functions such as multi-field packet classification, the algorithmic solutions [3, 4] employ various heuristics, leading to undeterministic performance that is often dependent on the characteristics of the data set On the other hand, reconfigurable hardware such as fieldprogrammable gate array (FPGA) combines the flexibility of software and the near-ASIC performance State-of-theart FPGA devices such as Xilinx Virtex-7 [5] and Altera Stratix-V [6] provide high clock rate, low power dissipation, rich on-chip resources and large amounts of embedded memory with configurable word width Due to their increasing capacity, modern FPGAs have been an attractive option for implementing various networking functions [7, 8, 9, 10] Compared with ASIC, FPGA technology gets increasingly favorable because of its shorter time to market, lower development cost and the shrinking performance gap between it and ASIC Due to the demand for TCAM to be flexible for configuration and to be easy for integration, there has been a growing interest in employing FPGA to implement TCAM or TCAM-equivalent search engines While there exist several FPGA-based TCAM designs, most of them are based on brute-force implementations to mimic the native TCAM architecture Their resouce usage is inefficient, which makes them less interesting in practice On the other hand, some recent work [11, 12, 13] shows that RAMs can be employed to emulate/implement a TCAM Ternary Content Addressable Memory (TCAM) is widely used in network infrastructure for various search functions There has been a growing interest in implementing TCAM using reconfigurable hardware such as Field Programmable Gate Array (FPGA) Most of existing FPGA-based TCAM designs are based on brute-force implementations, which result in inefficient on-chip resource usage As a result, existing designs support only a small TCAM size even with large FPGA devices They also suffer from significant throughput degradation in implementing a large TCAM, mainly caused by deep priority encoding This paper presents a scalable random access memory (RAM)-based TCAM architecture aiming for efficient implementation on state-ofthe-art FPGAs We give a formal study on RAM-based TCAM to unveil the ideas and the algorithms behind it To conquer the timing challenge, we propose a modular architecture consisting of arrays of small-size RAM-based TCAM units After decoupling the update logic from each unit, the modular architecture allows us to share each update engine among multiple units This leads to resource saving The capability of explicit range matching is also offered to avoid range-to-ternary conversion for search functions that require range matching Implementation on a Xilinx Virtex FPGA shows that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 150 million packets per second The resource usage scales linearly with the TCAM size The architecture is configurable, allowing various performance trade-offs to be exploited To the best of our knowledge, this is the first FPGA design that implements a TCAM larger than Mbits Categories and Subject Descriptors C.1.4 [Processor Architectures]: Parallel Architectures; C.2.6 [Computer Communication Networks]: Internetworking General Terms Algorithms, Design, Performance Keywords FPGA; RAM; TCAM INTRODUCTION Ternary Content Addressable Memory (TCAM) is a specialized associative memory where each bit can be 0, 1, or 978-1-4799-1640-5/13/$31.00 ©2013 IEEE 71 But none of them gives a correctness proof or a thorough study for efficient FPGA implementation Their architectures are monolithic, which not scale well in implementing large TCAMs A goal of this paper is to advance the FPGA-based TCAM designs by investigating both theory and architecture for RAM-based TCAM implementation The main contributions include: We describe the organization of a TCAM or RAM as Depth × Width, i.e., N ×W For example, a 2×1 RAM consists of words where each word is 1-bit We call a TCAM or RAM wide (or narrow ) if its width is large (or small) We call a TCAM or RAM deep (or shallow ) if its depth is large (or small) We also have the notation as shown in Table 1: • We give an in-depth introduction to the RAM-based TCAM We formalize the key ideas and the algorithms behind it We analyze thoroughly the theoretical performance of the RAM-based TCAM and identify the key challenges in implementing a large RAM-based TCAM Notation k t A An |s| • We propose a modular and scalable architecture that consists of arrays of small-size RAM-based TCAM units By decoupling the update logic from each unit, such a modular architecture enables each update engine to be shared among multiple units Thus the logic resource is saved 2.2 • We conduct comprehensive experiments to characterize the various performance trade-offs offered by the configurable architecture We also discuss the support of range matching without range-to-ternary conversion The rest of the paper is organized as follows Section gives a detailed introduction to the theoretic aspects of the RAM-based TCAM Section discusses the hardware architectures for scalable RAM-based TCAM Section presents the comprehensive evaluation results based on the implementation on a state-of-the-art FPGA Section reviews the related work on FPGA-based TCAM designs Section concludes the paper 2.2.1 Depth Extension The depth of a native TCAM is increased by stacking vertically words with the same width Correspondingly in the RAM-based implementation, the depth of a TCAM is extended by increasing the width of the RAM Each column of the RAM represents the match vector for a word Figure 1(b) shows a × TCAM which adds a word to the TCAM shown in Figure 1(a) Correspondingly the RAM-based implementation adds a column to the RAM shown in Figure 1(a) We see that the memory requirement of either the native TCAM or its RAM-based implementation is linear with the depth We can also view the depth extension as concatenating the match vectors from multiple “shallower” TCAMs For instance, a N × W TCAM can be horizontally divided into two TCAMs: one is N1 × W and the other N2 × W , where N = N1 + N2 Then there are two RAMs in the corresponding RAM-based implementation: one is 2W × N1 and the other 2W × N2 The outputs of the two RAMs are concatenated to obtain the final N -bit match vector This is essentially equivalent to building a wider RAM by concatenating two RAMs with the same depth For the sake of RAM-BASED TCAM 2.1 Main Ideas A TCAM can be divided into two logical areas: (1) TCAM words and (2) priority encoder Each TCAM word consists of a row of matching cells attached to a same matching line During lookup, each input key goes to all the N words in parallel and retrieves a N -bit match vector The i-th bit of the match vector indicates if the key matches the i-th word, i = 1, 2, · · · , N In this section, for ease of discussion we consider a TCAM without priority encoder Thus the output of the considered TCAM is a N -bit match vector instead of the index of the matching word with the highest priority Looking up a N × W TCAM is basically mapping a W bit binary input key into a N -bit binary match vector The same mapping can be achieved by using a 2W × N RAM where the W -bit input key is used as the address to access the RAM and each RAM word stores a N -bit vector Figure 1(a) shows a 1×1 TCAM and its corresponding RAM-based implementation As the TCAM word stores a “don’t care” bit, the match vector is always no matter the input 1-bit key is or • We share our experience in implementing the proposed architecture on a state-of-the-art FPGA The post place and route results show that our design can support a large TCAM of up to 2.4 Mbits while sustaining high throughput of 150 million packets per second (Mpps) To the best of our knowledge this is the first FPGA design that implements a TCAM larger than Mbits Table 1: Notation Description An input key or a binary number A ternary word An alphabet for 1-bit characters The set of all n-bit strings over A The length of a string s ∈ An |s| = n Terminology We first have the following definitions: • The Depth of a TCAM (or RAM) is the number of words in the TCAM (or RAM) Denoted as N • The Width of a TCAM (or RAM) is the width (i.e the number of bits) of each TCAM (or RAM) word Denoted as W • The Size of a TCAM (or RAM) is the total number of bits of the TCAM (or RAM) It equals N ì W ã The address width of a RAM is the number of bits of the RAM address Denoted as d Note that N = 2d for a RAM 72 Native TCAM (a) Key * RAM 1 Native TCAM Key[1] Key Match (b) 1 1 11 10 01 00 Key Key Key Key[1] Key[0] Key * 1 Key[0] Key * Principle shows the principle in populating the × RAM to represent a 1-bit TCAM where k ∈ {0, 1} and t ∈ {0, 1, ∗} Principle RAM[k]=1 if and only if k matches t; otherwise RAM[k]=0 simplicity, we consider the wide RAM built based on concatenating multiple RAMs as a single RAM Theorem The RAM populated following Principle achieves the equivalent function as the TCAM that stores t Width Extension A wider TCAM deals with a wider input key When implementing the TCAM in a single RAM, a wider input key (which is used as the address to access the RAM) indicates a wider address width for the RAM This results in a deeper RAM whose depth is 2W Figure 1(c) shows a × TCAM which extends the width of the TCAM shown in Figure 1(a) As the width of the input key is increased by bit, the depth of the RAM in the corresponding RAM-based TCAM gets doubled Such a design cannot scale well for wide input keys An alternative solution is using multiple “narrow” TCAMs to implement a wide TCAM For example, a N × W TCAM can be vertically divided into two TCAMs: one is N × W1 and the other N ×W2 , where W = W1 +W2 During lookup, a W -bit input key is divided into two segments accordingly: one is W1 -bit and the other W2 -bit Each of the two “narrower” TCAMs matches the corresponding segment of the key and outputs a N -bit match vector The two match vectors are then bitwise ANDed to obtain the final match vector The two “narrow” TCAMs map to two “shallow” RAMs in the corresponding RAM-based implementation The total memory requirement becomes 2W1 + 2W2 instead of 2W =2W1 · 2W2 Figure shows how a × TCAM is built based on two × TCAMs 2.2.3 Table 2: Representing a ternary bit in RAM The value of The value stored at the ternary bit RAM[0] RAM[1] 1 don’t care 1 Figure 1: (a) Matching a 1-bit key with a × TCAM; (b) Matching a 1-bit key with a 2×1 TCAM; (c) Matching a 2-bit key with a × TCAM 2.2.2 Key[1] Figure 2: Building a × TCAM using two × TCAMs * (c) Key Key[0] RAM Proof In the TCAM, the output for an input k is ⇔ k matches t Otherwise, the output is In the populated RAM, the output for an input k is ⇔ RAM[k]=1 Otherwise, the output is Thus the populated RAM is equivalent to the represented TCAM Both Principle and Theorem are directly applicable to the case of × W TCAM where k ∈ AW , A = {0, 1} and t ∈ AW , A = {0, 1, ∗} Principle can be extended to the case of a N ×W TCAM that is implemented in a 2W ×N RAM Let RAM[k][i] denote the i-th bit of the k-th word in the RAM Let ti denote the i-th word in the TCAM i = 1, 2, · · · , N So we have Principle for populating the 2W × N RAM to represent a N ×W TCAM, where k ∈ AW , A = {0, 1} and ti ∈ AW , A = {0, 1, ∗} Principle RAM[k][i]=1 if and only if k matches ti ; otherwise RAM[k][i]=0 i = 1, 2, · · · , N When a wide TCAM is built using multiple narrower TCAMs, the RAM corresponding to each narrow TCAM is populated individually by following Principle Populating the RAM 2.3 Given a set of ternary words, we need to populate the RAMs so that the RAM-based implementation can fulfill the same search function as the native TCAM As shown in Figure 1(a), it is easy to populate the RAM for the RAMbased implementation of a × TCAM Table shows the content of the × RAM populated for the × TCAM, where RAM[k] denotes the RAM word at the address k, k = 0, Algorithms and Analysis This section formalizes the algorithms for using a RAMbased TCAM that is built according to the discussion in Section 2.2 2.3.1 General Model Based on the discussion in Section 2.2.2, a N × W TCAM can be constructed using P narrow TCAMs, P = 1, 2, · · · , W 73 The size of the i-th TCAM is N × Wi , i = 1, 2, · · · , P , and P W = i=1 Wi Let RAMi denote the RAM corresponding to the i-th narrow TCAM, i = 1, 2, · · · , P The size of RAMi is 2Wi ×N , i = 1, 2, · · · , P Hence the N ×W TCAM can be implemented using these P RAMs subject to P Wi = W (2) i=1 W 2.3.2 Wi = P ·2 P when For a given P , min{W1 ,W2 ,··· ,WP } P i=1 W Wi = P , i = 1, 2, · · · , P Hence the overall memory requirement is minimum when all the P RAMs have the same address width, denoted as w = W The depth of each RAM P is 2w Then the overall memory requirement is Lookup Algorithm shows the algorithm to search a key over the RAM-based TCAM It takes O(1) time to access each RAM Since the P RAMs are accessed in parallel, the overall time complexity for lookup is O(1) W w P Wi (2 Algorithm Lookup Input: A W -bit key k Input: {RAMi }, i = 1, 2, · · · , P Output: A N -bit match vector m 1: Divide k into P segments: k → {k1 , k2 , · · · , kP } |ki | = Wi , i = 1, 2, · · · , P 2: Initialize m to be all 1’s: m ← 11 · · · 3: for i ← to P {bitwise AND} 4: m ← m & RAMi [ki ] 5: end for 2.3.3 i=1 Update 2.3.5 3.1 i=1 P W 2Wi ) P =1 {W1 ,W2 ,··· ,WP } HARDWARE ARCHITECTURE We are interested in implementing the RAM-based TCAM on FPGA While the theoretical discussion in Section excludes the priority encoder, the hardware architecture of the RAM-based TCAM must consider the priority encoder Space Analysis 2Wi = ( Comparison with Native TCAM Table 3: Native TCAM vs RAM-based TCAM Native TCAM RAM-based TCAM Lookup time O(1) O(1) Update time O(1) O(2w ) 2w Space NW NW w The size of RAMi is 2Wi × N , i = 1, 2, · · · , P Hence the Wi Wi overall memory requirement is P × N) = N P i=1 (2 i=1 To minimize the overall memory requirement, we formulate the problem as: P (3) Table summarizes the difference between the native TCAM and its corresponding implementation using P RAMs, with respect to time and space complexities Here we consider all the RAMs employ the same address width (w), so that both the update time and the space complexities achieve the optimum for the RAM-based TCAM (as discussed in Sections 2.3.3 and 2.3.4) Algorithm Updating a TCAM word Input: A W -bit ternary word t Input: The index of t: n Input: The update operation: op ∈ {add, delete} Output: Updated {RAMi }, i = 1, 2, · · · , P 1: Divide t into P segments: t → {t1 , t2 , · · · , tP } |ti | = Wi , i = 1, 2, · · · , P 2: for i ← to P {Update each RAM} 3: for k ← to 2Wi − 4: if k matches ti and op == add then 5: RAMi [k][n] = 6: else 7: RAMi [k][n] = 8: end if 9: end for 10: end for i=1 W w 2w N = NW w w We define the RAM/TCAM ratio as the number of RAM bits needed to implement a TCAM bit According to Equaw tion (3), the RAM/TCAM ratio is 2w when all the RAMs employ the same address width of w Basically a larger w results in a larger RAM/TCAM ratio, which indicates lower memory efficiency The minimum RAM/TCAM ratio is when w = (P = W ) or w = (P = W/2) In other words, when the depth of each RAM is (w = 1) or (w = 2), the overall memory requirement achieves the minimum, which is 2N W , i.e twice the size of the corresponding native TCAM Updating a TCAM can be either adding or deleting a specific TCAM word Algorithm shows the algorithm to add or delete the n-th word of the TCAM in the RAM-based implementation, where n = 1, 2, · · · , N It takes O(2Wi ) time to update the RAMi , i = 1, 2, · · · , P As the P RAMs are updated in parallel, the overall time complexity for update is determined by the RAM that takes the longest time for P Wi update, which is O(maxP ) = O(2maxi=1 Wi ) i=1 2.3.4 (2w · N ) = · N) = (1) i=1 74 Basic Architecture The theorectical model of the RAM-based TCAM implementation (discussed in Section 2.3.1) can be directly mapped to the hardware architecture shown in Figure A N × W TCAM is implemented using P RAMs where the size of the i-th RAM is 2Wi × N and P i=1 Wi = W As illustrated in Algorithm 1, a lookup is performed by dividing the input W -bit key into P segments where the length of the i-th segment is Wi , i = 1, 2, · · · , P Then each segment of the key is used as the address to access the corresponding RAM Each RAM outputs a N -bit vector The P N -bit vectors are then bitwise ANDed to generate the final match vector The match vector is finally fed into W1 Addr_in Data_out Key W the rest of the bits of these RAM words unchanged Hence we need to read the original content of the RAM words, change only the Id-th bit and then write the updated RAM words back to the RAM This requires · 2w clock cycles to update a single-port RAM whose address width is w To reduce the update latency, we utilize a simple dual-port RAM and perform the read and write at the same clock cycle A simple dual-port RAM has two input ports and one output port One input port is for Read only and the other is for Write only At each clock cycle during update, the update logic writes the updated RAM word to the address k while reading the content of the RAM word at the address k + Hence the update latency becomes 2w + clock cycles where the clock cycle is consumed to fetch the content of the first RAM word Another part of the update logic is a state matchine (not shown in Figure 4) that switches the state of the TCAM between lookup and update During update, no lookup is permitted and any match result is invalid N N N RAM1 N W2 Addr_in Data_out Priority Encoder RAM2 WP Addr_in ID Match Data_out RAMP Figure 3: Basic architecture (without update logic) 3.2 the priority encoder to obtain the index of the matching word with the highest priority A 1-bit Match signal is also generated to indicate if there is any match We add update logic to the RAM-based TCAM so that it can complete any update by itself at run time In accordance with Algorithm 2, Figure shows the logic for updating the RAM-based TCAM, where Wmax = maxP i=1 Wi We use two W -bit binary numbers, denoted as Data and M ask, to represent a W -bit ternary word t that is to be updated The i-th bit of t is “don’t care” bit if and only if the i-th bit of M ask is set to be 0, i = 1, 2, · · · , W For example, the 2-bit ternary word 0∗ can be represented by: Data=00 or 01, and M ask=10 Id specifies the index of the ternary word to be updated Op indicates if the ternary word is to be added (Op = 0) or deleted (Op = 1) Wmax-bit counter W1 W1 Data W Mask W +1 CMP CHG Id Op +1 W2 W2 CMP CHG +1 WP WP CMP CHG Modular Architecture In implementing a large-scale RAM-based TCAM on FPGA, there are two main challenges: • Throughput: When the TCAM is deeper or wider, the logic and the routing complexities become larger, especially for bitwise-ANDing a lot of wide bit vectors and for priority encoding a deep match vector This results in significant degradation in the achievable clock rate, which determines the maximum throughput of the RAM-based TCAM • Resource usage: The on-chip resource of a FPGA device is limited Hence we must optimize the architecture to save the resource or use the resource efficiently We need to find out the best memory configuration based on the physical capability It is also desirable to enable resource sharing between subsystems Addr_R Addr_W We Data_out Data_in We propose a scalable and modular architecture that employs configurable small-size RAM-based TCAM units as building blocks Both bit vector bitwise-ANDing and priority encoding are performed in a localized and pipelined fashion so that high throughput is sustained for large TCAMs We decouple the update logic from each unit so that a single update engine can be shared flexibly by multiple TCAM units On-chip logic resources are thus saved Note that such resource sharing is only possible in a modular architecture RAM1 Addr_R Addr_W We Data_out Data_in 3.2.1 RAM2 Overview The top-level design consists of a grid of units, which are organized in multiple rows Figure shows the top-level architecture with R rows each of which contains L units The TCAM words with higher priority are stored in the units with lower index The units within a row are searched sequentially in a pipelined fashion Priority is resolved locally within each unit After each row outputs a matching result, a global priority encoder is needed to select the one with the globally highest priority Addr_R Addr_W We Data_out Data_in RAMP 3.2.2 Figure 4: Update logic “CMP”: Compare “CHG”: Change Unit Design A TCAM unit is basically a U × W TCAM implemented in RAMs, where U is the number of TCAM words per unit Figure depicts the architecture of a TCAM unit Each unit performs the local TCAM lookup and combines the local match result with the result from the preceding unit As Adding or deleting the Id-th ternary word t is accomplished by setting or clearing the Id-th bit from all the RAM words whose addresses match t Meanwhile we must keep 75 Key_in Key_in Key_out Match_in ID_in Unit Key_in Key_in Key_out ID_in Match_out ID_out Unit L-1 Key_in Key_out Match_in Key_out Match_in Match_out ID_in ID_out Match_in Match_out ID_in ID_out Unit L+1 Key_in Unit 2L-1 Key_in Key_out Key_in Key_out Match_in Key_out Match_in Match_out ID_in ID_out Match_in Match_out ID_in ID_out Unit (R-1)L+1 Unit (R-1)L Matching ID Match_out ID_out Unit L ID_in Match_out ID_out Unit ID_in Match_in Priority Encoder ID_out Key Key_out Match_in Match_out ID_in Key_in Key_out Match_out ID_out Unit RL-1 Figure 5: Top-level architecture • Update logic is identical for the units with the same memory organization the unit index determines the priority, a matching TCAM word stored in the preceding units always has a higher priority than the local matching one The U × W TCAM is constructed using P RAMs based on the basic architecture shown in Section 3.1 We use the same address width w for all the P RAMs to achieve the maximum memory efficiency as discussed in Section 2.3.4 To save logic resource, it is desirable to share the update logic between units We decouple the update logic from units and build multiple update engines Each update engine contains the update logic and serves multiple units An update engine maintains a single state machine and decides which unit to be updated based on the index (Id) of the TCAM word to be updated A unit receives from its update engine the addresses and the write enable signals for its RAMs The unit also interacts with its update engine to exchange the bit vectors to update each RAM word Due to the decoupling of the update logic from the units, the association between the lookup units (LUs) and the update engines (UEs) is flexible The only constraint is that the units served by the same update engine must have the same memory organization (i.e P and w) Figure shows three different example layouts of the update engines in a 4-row, 4-unit-per-row architecture Key_out Key_in W Key Match ID ID_out ID_in MUX U×W TCAM (in RAM) Match_in Match_out Figure 6: A Unit 3.3 When W is large, there are many RAMs each of which outputs a U -bit vector The throughput may degrade in bitwise-ANDing a large number of bit vectors We divide a unit into multiple pipelined stages Let H denote the P number of stages in a unit Then each stage contains H RAMs Within each stage, the bit vectors generated by the P RAMs are bitwise-ANDed The resulting U -bit vector is H combined with the bit vector passed from the previous stage and then passed to the next stage The last stage of the unit performs the local priority encoding 3.2.3 Explicit Range Match In some network search applications such as access control list (ACL), a packet is matched against a set of rules An ACL-like rule specifies the match condition on each of the multiple packet header fields Some fields such as TCP ports are specified using ranges rather than ternary strings Taking 5-field ACL as an example, the two 16-bit port fields are normally specified in ranges The ranges must be converted into ternary strings so that such rules can be stored in a TCAM However, a range may be converted into multiple ternary strings A r-bit range can be expanded to 2(r − 1) prefixes or 2(r − 2) ternary strings If there are D of such fields in a rule, then this rule can be expanded to (2r − 4)D ternary words in the worst case Such a problem is called “rule expansion” [14] Various range encoding methods have been proposed to minimize rule expansion Even with the optimal range encoding [14], it needs r ternary words to Update Engine We make the following observations: • Updating a TCAM word involves updating only one unit 76 LU LU LU UE UE LU LU LU LU LU LU LU LU UE LU will be mapped to a 2dmin × W physical RAM Thus the w max(w,dmin ) instead of 2w RAM/TCAM ratio becomes w A trick that can be played is to map multiple shallow (logical) RAMs to a deep physical RAM For example, two 2d ×W (logical) RAMs can be mapped to a single 2d+1 ×W physical RAM But the throughput will be halved unless the physical RAM has two sets of input/output ports used independently for the two (logical) RAMs While some multi-port RAM designs [16] are available, they bring extra complications and are beyond the scope of this paper Hence when implementing the RAM-based TCAM in real hardware, the address width of each RAM, i.e w, should be carefully chosen based on the available physical configuration LU UE LU LU LU (a) Row LU LU LU LU UE UE LU LU LU LU LU LU LU LU UE LU LU UE (b) Column LU LU LU UE LU LU LU UE LU LU LU LU UE LU Table 4: Architectural parameters Parameter Description N TCAM depth W TCAM width R The number of rows L The number of units per row U The number of TCAM words per unit H The number of stages per unit w The address width of the RAM LU LU UE LU LU LU (c) Square Figure 7: Example layouts of update engines (UEs) LU: lookup unit represent a r-bit range In such a case, a rule with D range fields will occupy O(rD ) TCAM words An attractive advantage of FPGA compared with ASIC is that we can reprogram the hardware on-the-fly to add customorized logic So for the ACL-like search problems, we adopt the similar idea as [15] to augment the TCAM design with explicit range match support instead of converting ranges into ternary strings This is achieved by storing the lower and upper bounds of each range explicitly in registers Hence, if there are N rules each containing D r-bit port fields, then we require totally N ∗ D ∗ r ∗ bits of registers to store the lower and the upper bounds of all the ranges On the other hand, the size of the TCAM that needs to be stored in RAMs is reduced to N × (W − D · r) 3.4 PERFORMANCE EVALUATION We implement our modular RAM-based TCAM architecture on a Xilinx Virtex XC7V2000T device with -2 speed grade We evaluate the performance based on the post place and route results from the Xilinx Vivado 2013.1 development toolset To recap, we list the key parameters of the architecture in Table Note that N = R · L · U LU LU 4.1 Analysis and Estimation Due to its pipelined architecture, our RAM-based TCAM implementation processes one packet every clock cycle Thus the throughput is F million packets per second (Mpps) when the clock rate of the implementation achieves F MHz During lookup, each packet traverses the R rows in parallel It takes L · H clock cycles to go through each row One clock cycle is needed for final priority encoding when the architecture consists of more than one rows Thus the lookup latency in terms of the number of clock cycles is L·H if R = = L·H +1 if R > The address width of RAMs, i.e w, is a critical parameter in our RAM-based TCAM The update latency is 2w + while the memory requirement for implementing a w N × W TCAM is 2w N W To determine the optimal w, we examine the physical memory resource available on the FPGA device There are two types of memory resources in Xilinx Virtex FPGAs: distributed RAM and block RAM (BRAM) While BRAMs are provided as standalone RAMs, distributed RAM is coupled with logic resources The basic logic resource unit of FPGA is usually called a Slice Only a certain type of Slice, named SliceM, can be used to build the distributed RAM As required by our architecture, we consider the RAMs only in simple dual-port (SDP) mode Table summarizes the total amount and the minimum ad- Mapping to Physical Hardware According to the theoretical analysis in Section 2.3.4, the RAM-based implementation of a N ×W TCAM requires the minimum memory when employing shallow RAMs with the same depth of or However, real hardware has limitations on the minimum depth of physical RAMs For example, each block RAM (BRAM) available on a Xilinx Virtex FPGA can be configured as 512 × 72, 1K × 36, 2K × 18, 4K × 9, 8K × 4, 16K × 2, or 32K × 1, in simple dual-port mode In other words, the minimum depth for a BRAM is 512=29 Let dmin denote the minimum address width of the physical RAM A N × W (logical) RAM where N ≤ 2dmin 77 Throughput (Mpps) Memory (Kbits) 16384 100% 8192 80% L=4 4096 60% L=8 2048 40% L=16 1024 20% L=32 512 0% 250 199 178 200 156 150 154 134 105 155 138 117 94 100 L=4 L=8 L=16 L=32 50 1024 2048 4096 8192 1024 16384 2048 4096 Utilization 8192 16384 # words (N) # words (N) Power (Watts) # Slices 524288 100% 262144 80% L=4 131072 60% L=8 65536 40% L=16 32768 20% L=32 16384 0% 25 19 20 20 L=4 15 11 10 L=8 L=16 L=32 1024 2048 4096 8192 1024 2048 4096 8192 16384 16384 Utilization # words (N) # words (N) Figure 8: Increasing the TCAM depth (N ) dress width (dmin ) of the memory resource available on our target FPGA device The key performance metrics include the throughput, the memory requirement, the power consumption estimates, and the resource usage In these experiments, the default parameter settings are L = 4, U = 64, H = 1, and w = Each unit contains its own update logic First, we fix W = 150 and increase N by doubling R Figure shows the results where the memory and the Slices results are drawn using a logarithmic scale The throughput is measured as As expected, the throughput degrades for a deeper TCAM This is because a larger R results in a deeper final priority encoder which becomes the critical path Also with a larger TCAM, the resource utilization approaches 100% This makes it difficult to route signals, which further lowers the achievable clock rate Fortunately because of the configurable architecture, we can trade the latency for throughput Since N = R ·L·U , we can increase L to reduce R for a given N while keeping other parameters fixed As shown in Figure 8, a larger L results in a higher throughput, though it is at the expense of a larger latency By tuning the latency-throughput trade-off, our design can sustain a 150 MHz clock rate for large TCAMs up to 16K × 150 bits = 2.4 Mbits Such a clock rate allows the design to process 150 million packets per second (Mpps) which translates to 100 Gbps throughput for minimum-size Ethernet packets Second, we fix the TCAM depth N = 4096 and increase the TCAM width W Figure shows that a larger TCAM width results in a lower throughput This is because there RAMs per unit where w = in the implementation are W w With a large W , it becomes time-critical to bitwise-AND a large number of bit vectors within each unit Again this can be amended by trading the latency for throughput We increase the number of stages per unit so that each stage handles a smaller number of RAMs As shown in Figure 9, the throughput is improved by increasing H by This on the other hand increases the latency by L = clock cycles In both the above experiments, the resource usage is linear with the TCAM size (N × W ) The estimated power con- Table 5: Memory resource on a XC7V2000T RAM type (in SDP mode) Total size (bits) dmin Distributed RAM 16550400 BRAM 47628288 Either distributed RAM or BRAM can be employed to implement the RAM-based TCAM architecture In either case, we set w=dmin of the employed RAM type to achieve the highest memory efficiency Based on the information from Table we can estimate the maximum size of the TCAM that can be implemented on the target device When the architecture is implemented using distributed RAM, the RAM/TCAM ratio is 25 and the maximum TCAM size is 16550400 =2586000 bits When using BRAM, the RAM/TCAM 32/5 ratio is 29 and the maximum TCAM size is 47628288 =837216 512/9 bits We can see that, though the total amount of BRAM bits is nearly the triple of that of distributed RAM bits, BRAM-based implementation supports a much smaller TCAM due to the higher RAM/TCAM ratio Moreover, the update latency of distributed RAM-based implementation is 33 clock cycles, while the update latency for BRAM-based implementation is 513 clock cycles Hence in most of our experiments, distributed RAMs (w = 5) are employed Also note that our architecture is modular where each unit may independently select the RAM type for TCAM implementation Thus the maximum TCAM size would be 3423216 bits when both distributed RAMs and BRAMs are utilized 4.2 Scalability We are interested in how the performance scales when the TCAM depth (i.e N ) or the TCAM width (W ) is increased 78 Throughput (Mpps) 190 170 150 130 110 90 70 50 Memory (Kbits) 8192 169 156 60% 153 143 134 50% 128 40% 4096 H=1 H=2 200 250 150 300 200 250 300 # Slices 262144 12 12 60% 50% 10 40% 131072 H=1 H=2 200 250 30% H=1 20% H=2 10% 65536 150 Utilization Word width (W) Power (Watts) H=2 0% Word width (W) 14 12 10 H=1 20% 10% 2048 150 30% Utilization 0% 150 300 200 250 300 Word width (W) Word width (W) Figure 9: Increasing the TCAM width (W ) 4.4 sumption is sublinear with the TCAM depth while is linear with the TCAM width 4.3 Impact of Unit Size Each TCAM unit in our architecture stores U TCAM words It is desirable to have a small U so that the local bit vector bitwise-ANDing and priority encoding within each unit not become the critical path On the other hand a smaller U leads to a larger L when R is fixed for a given N Thus we can tune the latency-throughput trade-off by changing U In this experiment, we fix R = 4, H = and vary U in implementing a 1024 × 150 TCAM As expected, Figure 10 shows that a larger U results in a lower throughput as well as a lower latency Such a trade-off can be exploited for some latency-sensitive applications where the latency is measured in terms of nanoseconds instead of the number of clock cycles Based on the results shown in Figure 10, when U is doubled from 64 to 128, the throughput is slightly degraded while the latency is reduced from ∗ = 30 ns to ∗ 5.27 = 21 ns The change of U has little impact on other performance metrics, which thus are not shown here 250 200 190 162 150 100 50 4.5 Latency Throughput 199 128 Impact of Update Engine Layout As discussed in Section 3.2.3, we can have flexible associations between lookup units and update engines by decoupling the update logic from each unit We conduct experiments to evaluate the impact of different update engine (UE) layouts on the performance of the architecture The evaluated update engine layouts include: Throughput (Mpps) Latency (# of clock cycles) 64 Distributed vs Block RAMs As discussed in Section 4.1, distributed RAMs are more efficient than BRAMs in implementing the RAM-based TCAM on the target FPGA But usually it is desirable to integrate the RAM-based TCAM with other engines (such as a packet parser) in a single FPGA device to comprise a complete packet processing system Then the choice of the RAM type may depend on not only the efficiency but also the resource budget BRAMs will be preferred to implement the RAM-based TCAM in case the other engines require a lot of Slices but few BRAMs Hence we conduct experiments to characterize the performance of the RAMbased TCAMs implemented using the two different RAM types In these experiments, W = 150, L = 4, U = 64, and H = Each TCAM unit contains its own update logic As shown in Table 6, distributed RAM-based implementations achieve higher clock rates and lower power consumption than BRAM-based implementations This is due to the fact that a BRAM is deeper and larger, and thus requires longer access time and dissipates more power than a distributed RAM Because distributed RAMs are based on Slices (SliceM), the distributed RAM-based implementations require much more logic resource (in terms of Slices) than BRAM-based implementations 256 Unit size (U) • All : Each unit contains its own update logic Figure 10: Increasing the unit size (U ) • Square: The four neighboring units forming a square share the same update engine (Figure 7(c)) 79 Table 6: Implementation results based on different RAM types TCAM size: N × W 1024×150 bits 2048×150 bits 4096×150 bits RAM type Distributed Block Distributed Block Distributed Block Throughput (Mpps) 199 139 178 125 156 125 # of Slices 20526 12138 40239 23560 80622 45632 (Utilization) (6.72%) (3.97%) (13.18%) (7.71%) (26.40%) (14.94%) # of BRAMs 272 544 1088 (Utilization) (0.00%) (21.05%) (0.00%) (42.11%) (0.00%) (84.21%) Estimated Power (Watts) 1.933 3.211 3.448 5.73 6.135 10.757 Throughput (Mpps) Memory (Kbits) 250 1200 199 200 194 186 195 200 990 990 All Square 990 990 Row Column None 800 150 600 400 100 200 50 All Square Row Column None UE layout 1.9 2.1 2.2 # Slices 32768 2.2 8% 20526 1.5 1.5 15253 16384 15343 6% 14961 8659 8192 4096 Square Row Column 0% All None 4% 2% 0.5 All 7% 6% 5% 4% 3% 2% 1% 0% UE layout Power (Watts) 2.5 990 1000 Square Row Column None UE layout UE layout Figure 11: Impact of the update engine (UE) layout • Row : The units in a same row share the same update engine (Figure 7(a)) layout has no effect on the memory requirement which is determined only by the lookup units • Column: The units in a same column share the same update engine (Figure 7(b)) 4.6 Cost of Explicit Range Matching As discussed in Section 3.3, we provide the capability to add explicit range matching logic to the TCAM architecture so that range-to-ternary conversion can be avoided for some search applications such as ACL Such explicit range matching logic is based on a heavy use of registers We conduct experiments to understand the performance cost of the explicit range matching logic We fix W = 150 and increase the number of 16-bit fields that are specified in ranges The other parameters are by default: N = 1024, R = 4, L = 4, U = 64, H = 1, and w = (distributed RAM) Each TCAM unit has its own update logic Table shows that adding the explicit range matching logic for every 16-bit range-based field requires 5K more Slices and 30K more registers The increased usage of logic also results in higher power consumption Whether to enable the explicit range matching should be based on the characteristics of the ruleset used in the search application Consider a ruleset whose expansion ratio (due to range-to-ternary conversion) is a while it requires b times more logic resource to add the explicit range matching logic Then it is better not to enable the explicit range matching if a < b • None: No update logic for any unit The TCAM is not updatable In these experiments, N = 1024, W = 150, R = 4, L = 4, U = 64, H = 1, and w = So the architecture consists of by units, basically the same as illustrated in Figure The implementation results are shown in Figure 11 Comparing the Slice results of the All and the None layouts, we can infer that the update logic accounts for more than half of the total logic usage of the architecture in the All layout In the Square, Row, and Column layouts, by sharing the update engine, the logic resource is reduced by roughly 25%, compared with the All layout These three layouts achieve the similar logic resource saving, because all of them have each update engine shared by four lookup units The costs of sharing the update engine include the slightly degraded throughput and the slightly increased power consumption Such costs are basically due to the wide mux/demux and the stretched signal routing between lookup units and update engines Higher throughput could be obtained by careful chip floor planning Also note that the update engine 80 Table 7: Adding explicit range matching # of range fields Throughput (Mpps) 199 202 201 # of Slices 20526 26157 31328 (Utilization) (6.72%) (8.56%) (10.26%) # of Registers 37556 68917 98245 (Utilization) (1.54%) (2.82%) (4.02%) Est Power (Watts) 1.933 2.108 2.474 CONCLUSION TCAMs are widely used in network infrastructure for various search functions There have been growing interests in implementing TCAMs using reconfigurable hardware such as FPGA Such “soft” TCAMs are more flexible and easier to integrate than ASIC-based “hard” TCAMs But existing FPGA-based TCAM designs can support only small-size TCAMs, mainly due to the inefficient resource usage This paper shares our efforts and experience on pushing the limit in implementing large TCAMs on a state-of-the-art FPGA We formalize the ideas and the algorithms behind the RAM-based TCAM and analyze the performance thoroughly After identifying the key challenges, we propose a scalable and modular architecture with multiple optimizations We evaluate our design conprehensively to understand various performance trade-offs The FPGA implementation results show that our design can support a large TCAM of 2.4 Mbits while sustaining high throughput of 150 Mpps RELATED WORK Although various algorithmic solutions (including those using FPGAs) [4, 10] have been proposed as alternatives to TCAMs, their success so far is limited to a few particular applications such as exact matching and longest prefix matching While they can exploit efficiently the characteristics of real-life data sets, these algorithmic solutions cannot provide the same deterministic performance (e.g throughput, latency, storage requirement, etc.) as TCAMs on searching over an arbitrary set of ternary words Most of existing FPGA-based TCAM designs are based on brute-force implementations which map the native TCAM architecture directly onto FPGA logic A straightforward method is using two bits of registers to encode one TCAM bit But such a design cannot scale well due to the limited amount of registers which usually are heavily used for various other purposes such as pipelining For example, the target FPGA device (XC7V2000T) in our experiments contains 2.4 Mbits of registers, which may be used to implement a TCAM of no larger than 1.2 Mbits In reality the TCAM that can be implemented using registers would be much smaller as a result of routing and timing challenges Locke [17] proposes a more efficient design based on the 16bit Shift Register (SRL16) A SRL16 is used to build a 2-bit TCAM Like the distributed RAM, a SRL16 is based on SliceM A SliceM can be converted into either SRL16s or a 32 × distributed RAM in single dual-port mode The larget TCAM that can be implmented using SRL16s on our target device (XC7V2000T) would be 0.6 Mbits Recently Ullah et al [13] and Zerbini et al [12] present the FPGA implementation of their RAM-based TCAM, respectively These designs contain the similar basic idea as our design that uses the search key as the address to access RAMs However, neither of them gives a theoretic analysis or a correctness proof on the construction of TCAM using RAMs Their architectures are monolithic, which could be viewed as a single large one-stage TCAM unit in our modular architecture When implementing a large TCAM, their monolithic architectures would suffer from bitwise-ANDing many wide bit vectors and priority encoding the deep match vector Due to the lack of thorough investigation on the optimal settings, their FPGA implementations are less efficient than our design [13] implements a 512 × 36 TCAM using more than Mbits BRAMs on a Xilinx Virtex FPGA When the priority encoder is added, the clock rate of their implementation is merely 22 MHz The TCAM designs of [12] are implemented on the high-end Altera FPGAs with the fastest speed grade Even with these large-capacity FPGAs, their implementations can support a TCAM of no larger than 0.5 Mbits REFERENCES [1] Openflow - enabling innovation in your network http://www.openflow.org [2] P Bosshart, G Gibb, H.-S Kim, G Varghese, N McKeown, M Izzard, F Mujica, and M Horowitz Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN In SIGCOMM ’13: Proceedings of the ACM SIGCOMM 2013 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pages 99–110, August 2013 [3] D E Taylor Survey and taxonomy of packet classification techniques ACM Comput Surv., 37(3):238–275, Sept 2005 [4] F Baboescu, S Singh, and G Varghese Packet classification for core routers: Is there an alternative to CAMs? In INFOCOM ’03: Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, volume 1, pages 53–63, March/April 2003 [5] Xilinx Virtex-7 FPGA Family http://www.xilinx.com/products/silicondevices/fpga/virtex-7.html [6] Altera Stratix V FPGAs http://www.altera.com/devices/fpga/stratixfpgas/stratix-v/stxv-index.jsp [7] M Becchi and P Crowley Efficient regular expression evaluation: theory to practice In ANCS ’08: Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 50–59, 2008 [8] M Attig and G J Brebner 400 Gb/s programmable packet parsing on a single FPGA In ANCS ’11: Proceedings of the 7th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, pages 12–23, 2011 [9] G Gogniat, T Wolf, W Burleson, J.-P Diguet, L Bossuet, and R Vaslin Reconfigurable hardware for high-security/ high-performance embedded systems: the SAFES perspective IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(2):144–155, 2008 81 [10] W Jiang and V K Prasanna Scalable packet classification on FPGA IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 20(9):1668–1680, 2012 [11] W Jiang and V K Prasanna Field-split parallel architecture for high performance multi-match packet classification using FPGAs In SPAA ’09: Proceedings of the 21st annual symposium on Parallelism in algorithms and architectures, pages 188–196, 2009 [12] C A Zerbini and J M Finochietto Performance evaluation of packet classification on FPGA-based TCAM emulation architectures In Globecom ’12: Proceedings of the IEEE Global Communications Conference, pages 2766–2771, 2012 [13] Z Ullah, M K Jaiswal, Y C Chan and R C C Cheung FPGA Implementation of SRAM-based Ternary Content Addressable Memory In IPDPSW ’12: Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012 [14] O Rottenstreich, R Cohen, D Raz and I Keslassy Exact worst-Case TCAM rule expansion IEEE Transactions on Computers, 62(6):1127–1140, 2013 [15] E Spitznagel, D Taylor, and J Turner Packet classification using extended TCAMs In ICNP ’03: Proceedings of the 11th IEEE International Conference on Network Protocols, pages 120–131, 2003 [16] C E LaForest and J G Steffan Efficient multi-ported memories for FPGAs In FPGA ’10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, pages 41–50, 2010 [17] K Locke XAPP1151 - Parameterizable Content-Addressable Memory 2011 http://www.xilinx.com/support/documentation/ application notes/xapp1151 Param CAM.pdf 82 ... 110 90 70 50 Memory (Kbits) 81 92 169 156 60% 153 143 134 50% 128 40% 4096 H=1 H =2 200 25 0 150 300 20 0 25 0 300 # Slices 26 2144 12 12 60% 50% 10 40% 1310 72 H=1 H =2 200 25 0 30% H=1 20 % H =2 10% 65536... 125 156 125 # of Slices 20 526 121 38 4 023 9 23 560 80 622 456 32 (Utilization) (6. 72% ) (3.97%) (13.18%) (7.71%) (26 .40%) (14.94%) # of BRAMs 27 2 544 1088 (Utilization) (0.00%) (21 .05%) (0.00%) ( 42. 11%)... 1 024 20 48 4096 81 92 1 024 16384 20 48 4096 Utilization 81 92 16384 # words (N) # words (N) Power (Watts) # Slices 524 288 100% 26 2144 80% L=4 1310 72 60% L=8 65536 40% L=16 327 68 20 % L= 32 16384 0% 25