Abstract—Ternary contentaddressable memory (TCAM) based search engines play an important role in networking routers. The search space demands of TCAM applications are constantly rising. However, existing realizations of TCAM on fieldprogrammable gate arrays (FPGAs) suffer from storage inefficiency. This paper presents a multipumpingenabled multiported SRAMbased TCAM design on FPGA, to achieve an efficient utilization of SRAM memory. Existing SRAMbased solutions for TCAM reduce the impact of the increase in the traditional TCAM pattern width from an exponential growth in memory usage to a linear one using cascaded block RAMs (BRAMs) on FPGA. However, BRAMs on stateoftheart FPGAs have a minimum depth limitation, which limits the storage efficiency for TCAM bits. Our proposed solution avoids this limitation by mapping the traditional TCAM table divisions to shallow subblocks of the configured BRAMs, thus achieving a memoryefficient TCAM memory design. The proposed solution operates the configured simple dualport BRAMs of the design as multiported SRAM using the multipumping technique, by clocking them with a higher internal clock frequency to access the subblocks of the BRAM in one system cycle. We implemented our proposed design on a Virtex6 xc6vlx760 FPGA device. Compared with existing FPGAbased TCAM designs, our proposed method achieves up to 2.85 times better performance per memory. Index Terms—Block RAM (BRAM), fieldprogrammable gate array (FPGA), memory architecture, multiported memory, multipumping, SRAMbased TCAM I. INTRODUCTION Ternary contentaddressable memory (TCAM) compares an input word with its entire stored data in parallel, and outputs the matched word’s address. TCAM stores data in three states: 0, 1, and X (don’t care). Traditional TCAMs are built in applicationspecific integrated circuit (ASIC), and offer highspeed search operations in a deterministic time. TCAM is widely employed to design highspeed search engines and has applications in networking, artificialintelligence, data compression, radar signal tracking, pattern matching in virusdetection, gene pattern searching in bioinformatics, image processing, and to accelerate various database search primitives 1–3. The Internetofthings and bigdata processing devices employ TCAM as a filter when storing signature patterns, and achieve a substantial reduction in Corresponding author: JeongA Lee (jaleechosun.ac.kr). Inayat Ullah, and JeongA Lee are with the Department of Computer Engineering, Chosun University, Republic of Korea email: (inayatmzgmail.com). Zahid Ullah is with the Department of Electrical Engineering, CECOS university of IT Emerging Sciences, Peshawar, Pakistan. energy consumption by reducing wireless data transmissions of invalid data to cloud servers 4, 5. Fieldprogrammable gate arrays (FPGAs) emulate TCAM using static randomaccess memory (SRAM), by addressing SRAM with TCAM contents. Each SRAM word corresponds to a specific TCAM pattern, and stores information on its existence for all possible data of the TCAM table. The increase in the number of TCAM pattern bits results in an exponential growth in memory usage. This exponential growth in memory usage has been reduced to linear growth by cascading multiple SRAM blocks in the design of TCAM on FPGA in previous work 6, 7. Contemporary FPGAs implement blockRAM (BRAM) in the silicon substrate, and offer a high speed. For example, Xilinx Virtex6 xc6vlx760 FPGA contains 720 BRAMs of size 36 Kb 8, and provide operating frequencies of greater than 500 MHz 9. Designers utilize these highspeed SRAM blocks to design SRAMbased TCAMs on FPGA. In existing SRAMbased solutions, the storage capacity of a BRAM for TCAM bits is limited by its higher SRAMTCAM ratio 29 9 , because of its minimum depth limitation of 512 × 72 when configured in simple dualport mode on FPGA 8. For example, the design methodologies proposed in 10, 11, and 12, require a total of 56, 40, and 40 BRAMs of size 36 Kb, respectively, to implement an 18 Kb TCAM. Excessive usage of BRAMs in the design of TCAM can result in a lack of BRAMs for other parts of the system on FPGA. Furthermore, the limited amount of BRAM resources on FPGA can compel designers to implement TCAMs in distributed RAM using SLICEM, resulting in the consumption of many slices, and a limitation on the maximum clock frequency of the design. This problem becomes more severe for the design of large storage capacity TCAMs. The efficient utilization of SRAM memory is imperative for the design of TCAMs on FPGAs. The design of memoryefficient TCAMs requires shallow SRAM blocks on FPGAs. Multipumpingbased multiported SRAM emulates the subblocks of a dual port SRAM block as multiple shallow SRAM blocks, by operating SRAM with a higher frequency clock, allowing access to its subblocks in one system cycle. Researchers have designed efficient multiported memories using BRAMs on FPGA 13–16. Existing FPGAbased TCAM design methodologies offer lower operational frequencies. This is mainly because of the complex wide signals routing between BRAMs and logic resulting from excessive usage of BRAMs and complex pri21693536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republicationredistribution requires IEEE permission. See http:www.ieee.orgpublications_standardspublicationsrightsindex.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109ACCESS.2018.2822311, IEEE Access 2 ority encoding units synthesized in logic slices for deeper traditional TCAMs. For example, the FPGA realizations of TCAM using BRAMs in 7 and 17 achieve operational frequencies of 139 MHz and 133 MHz to emulate 150 Kb and 89 Kb TCAMs, respectively. The highest operational frequency achieved in the previous studies 6, 10–12 is 202 MHz for the implementation of an 18 Kb TCAM on FPGA. The demand for efficient utilization of SRAM memory in the design of TCAM and the speed provided by existing FPGAbased TCAM solutions make the use of multipumping based multiported SRAM more practical for designing TCAM memory on FPGA. Our proposed TCAM design aims to achieve efficient memory utilization with a high throughput. The contributions of this work are as follows: • A novel multipumpingenabled multiported SRAMbased TCAM architecture, which achieves efficient memory utilization, is proposed. • Our proposed approach presents a scalable and modular TCAM design on FPGA. • The proposed design is more practical for large storage capacities, owing to the reduced routing complexity achieved by the use of fewer BRAMs and the reduced AND operation complexity. The novel optimization technique of ANDaccumulating SRAM words in the proposed TCAM memory units divides the overall AND operation complexity of the design. • The proposed design is implemented on a stateoftheart FPGA. A detailed comparison of our proposed design with existing methods is performed with respect to the performance per memory. Our proposed design achieves a performance that is up to 2:85× higher per memory. The remainder of this paper is organized as follows. Section II surveys related work. The proposed design is described in Section III. Section IV details the implementation setup and results of this work. The performance evaluation of the proposed design is detailed in Section V. Section VI concludes this work. Table I describes the basic notations used in paper.
UE-TCAM: An Ultra Efficient SRAM-based TCAM Zahid Ullah(1), Manish K Jaiswal(2), Ray c.c Cheung(3), and Hayden K.H SO( ) l Department of Electrical Engineering, CECOS University of IT and Emerging Sciences, Peshawar, Pakistan( ) ) ( Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong , Department of Electronic Engineering, City University of Hong Kong, Hong Kong(3 ) Emails:zahidullah@cecos.edu.pk(L).manishkj@eee.hku.hk(2).r.cheung@cityu.edu.hk(3 ).hso@eee.hku.hk(4) Abstract-Ternary content-addressable memories (TeAMs) are high speed memories; however, compared to static random access memories (SRAMs), TeAMs suffer from low storage density, relatively slow access time, poor scalability, complexity in circuitry, and higher cost To access the benefits of SRAM, several SRAM-based TeAMs, specifically on field-programmable gate arra y (FPGA) platforms, were proposed To further improve the performance of SRAM-based TeAMs, this paper presents UE-TeAM, which reduces memory requirement, latency, power consumption, and improves speed An example design of 512 x 36 of UE-TeAM has been implemented on Xilinx Virtex-6 FPGA Performance evaluation confirms a significant improvement in the proposed UE-TeAM, which achieves 100% reduction in 18K B-RAMs, 74.67% reduction in SRs, 70.28% reduction in LUTs, 75.76% reduction in energy -delay product, and 60% reduction in latency and improves speed by 70.85%, compared with the available SRAM-based TeAM I INTRODUCTION Ternary content-addressable memory (TCAM) provides access to stored data by contents (data word) rather than by an address and outputs the match address CAM searches its entire memory concurrently to check if that data word is stored anywhere in CAM memory CAM returns a list of one or more storage addresses where the word was found The fast search feature is the main influence behind using a CAM The search operation can also be performed in regular random access memory (RAM) by iteratively reading and comparing entire RAM entries for every search request As a result, the search time using RAM is significantly longer than the CAM for the same search request The high-speed search operation makes CAM an attractive choice for applications requiring high-speed search such as local-area network, databases management, pattern recogni tion, and artificial intelligence [1] Recent applications include real-time pattern matching in virus-detection and intrusion detection systems, gene pattern searching in bioinformatics, data compression, and image processing [2] A Problem statement Although CAM technology presents a major advantage of a deterministic comparison in a constant time over standard RAM, yet it also has shortcomings For parallel search op eration, CAM needs comparison circuitry in each cell, which dictates that CAM density lags RAM density Typical TCAM cell has two SRAM cells and a comparison circuitry A table of size 211 x w needs 211+ X w SRAM cells and 211 x w comparison circuitries, with one for each TCAM cell For large values of n, 978-1-4799-8641-5/15/$31.00 ©2015 IEEE the TCAM size increases, which results in a prohibitive power consumption, size, and cost; thus, nullifying its advantage of high-speed lookup The comparison circuity in each cell not only makes TCAM expensive but also adds complexity to the TCAM architecture The extra logic and capacitive loading due to the massive parallelism lengthen the access time of TCAM, which is over 3.3 times longer than the access time of SRAM [3] Furthermore, TCAM is not subjected to the intense com mercial competition found in the RAM market [4] and yet to gain a substantial market share TCAMs are expensive not only due to their low memory cell density but also due to their insignificant market demand, which means they are not produced in mass to drive their cost down The cost of TCAM is about 30 times more per bit of storage than the SRAM [3] In addition, inherited architectural barriers also limit its total chip capacity Complex integration of memory and logic also makes TCAM testing very time consuming [2] CAMs have limited pattern retrieval capacity and also CAM technology does not evolve as fast as the RAM tech nology RAM technology is driven by many applications, par ticularly computers and consumer electronic products; hence, cost per bit continuously decreases, as opposed to the CAM technology, which is considered specialized and only a modest increase in bit capacity and a modest decrease in cost may be expected in future [5] TCAM does not scale well in terms of clock rate, power consumption, or chip density whereas SRAM is scalable and less complex The throughput of classical TCAMs is also limited by the relatively low speed of TCAMs [6] B Motivations and contributions Field-programmable gate arrays (FPGAs) have a wide use in different applications [7] such as in image processing [8], [9], networking systems [10], [11], and cryptography com putations [12], [13] owing to several benefits such as its reconfigure-ability, massive hardware parallelism, and rapid prototyping capability SRAM-based FPGAs such as Xilinx Virtex-6 and Virtex-7 [14] provide high clock rate and a large amount of on-chip dual-port memory with configurable word width Xilinx Virtex-7 2000T FPGA is ideally suited for the application-specific integrated circuit (ASIC) prototyping The Virtex-7 2000T provides equivalent capacity and per formance to high density ASICs, reduces board space re quirements and complexity, and furthermore, reduces system level power consumption The current FPGA technology does not have hard IPs for the classical TCAMs; however, it has for SRAMs Benefits of SRAM over CAM and feasibility of FPGA technology have motivated us to go for innovative designs of TCAM impact on the performance of the RAM-based CAM in [4] With increase in the number of stored elements, performance of the method becomes gracefully degradable Further, the method emulates Binary CAM not the TCAM The proposed UE-TCAM architecture is build on the suc cession of the prior work on HP-TCAM [15], Z-TCAM [16], and E-TCAM [17] The proposed work in the paper makes the following key contributions The method in [18] also exploits hashing technique for TCAM Being based on hashing technique, it also suffers from collisions and bucket overflow, which needs additional area If the overflow area has many records, then a search operation may not finish until many buckets are searched Furthermore, when stored keys contain don't care bits in the bit positions used for hashing, then such keys must be duplicated in multiple buckets, which results in large memory; thus, the memory utilization is not efficient • Architecture of the proposed TCAM is much simpler, which consists of primarily SRAM units with simple additional logic and is implemented on state-of-the-art Xilinx FPGA • The proposed UE-TCAM brings an enormous reduc tion in resource utilization Implementation results illustrates that our UE-TCAM attains 100% reduc tion in 18K B-RAMs, 74.67% reduction in SRs, and 70 28% reduction in LUTs, compared with the available SRAM-based TCAM • Energy/bit/search is a very useful performance metric for TCAM Compared with the existing SRAM-based TCAM, the proposed TCAM gets 58.58% reduction in energy consumption • Latency is another important performance metric The UE-TCAM also contributes by reducing latency 60%, compared with the available SRAM-based TCAM • Compared with the state-of-the-art SRAM-based TCAM design, the UE-TCAM also improves speed by 70.85% Getting higher throughput with much simpler architecture is a beauty of the proposed work The proposed work may be used in network systems, web-enabled applications, and also in cloud computing Other applications that can benefit from the proposed TCAM are data compression, image recognition processors, voice recognition processor, or any pattern recognition system in general We expect that CAM technology will become main-stream for many applications in the near future Thus, the use of CAM technology paves the way for our proposed work in the emerging applications C Paper organization The rest of the paper is organized as follows: Section II discusses related work Section III explains hybrid partitioning, which realizes architectures of the SRAM-based TCAMs Section IV presents architecture of the proposed UE-TCAM Section V explains UE-TCAM operations Section VI elabo rates operations of the proposed TCAM with examples Sec tion VII provides implementation and performance evaluation of the UE-TCAM Section VIII concludes the paper and also highlights our future work II RELATED WORK We surveyed the literature on RAM-based CAMs and to the best of our knowledge, we found very few works on it RAM-based CAM proposed in [4] uses hashing technique; thus, inherits the inborn disadvantages of hashing-collisions and bucket overflow Number of stored elements has a great Hashing technique also cannot provide deterministic per formance due to potential collisions and is inefficient in han dling wild-card [19] In contrast to the hashed-based CAMs, the proposed TCAM provides a deterministic search perfor mance and efficiently utilizes memory SRAM-based pipelined CAMs also take multiple clock cycles to accomplish a search operation and the memory utilization is also not efficient [20] In contrast, our proposed TCAM has a deterministic through put of a single clock cycle and also provides a better utilization of memory RAM-based CAMs in [5] and [21] also have unavoidable shortcomings Size of memory in both methods depends on the number of bits (nob) in TCAM word In [5], the required memory size would be 2110b bits arranged in a column Size increases exponentially with increase in the number of bits in TCAM word For instance, 36 bits word needs a 64 GB of RAM Such a huge memory results in prohibitive area, cost, and power consumption; thus, it makes the method practically infeasible for an arbitrarily large bit pattern Whereas, the pro posed design has a suitable partitioning scheme and efficiently supports arbitrarily large words In [21], increase in the number of bits in CAM word exponentially increases the memory size to a prohibitive limit, like [5] Furthermore, RAM-based CAM in [21] works only on data arranged in ascending order, which is against the norm of a real application where data are totally random To arrange data in ascending order, the original order of entries needs to be preserved, which is not considered in this method However, if considered, the memory and power requirements will further increase In contrast, our proposed TCAM supports an arbitrarily large bit pattern, preserved original addresses, and also a suitable partitioning methodology CAM in [22] integrates CAM and RAM to get overall CAM functionality; thus, inherits the inborn disadvantages of CAM This scheme arranges traditional TCAM table into groups based on some distinguishing bits in TCAM words So each group can have at most one possible match Since data in real applications are totally random, making groups would be very time consuming On the contrary, the proposed method provides a generic TCAM and uses SRAM, not CAM, to emulate over all TCAM functionality State-of-the-art SRAM-based TCAMs-HP-TCAM [15], Z-TCAM [16], and E-TCAM [17] are recently published Our proposed UE-TCAM improves them by lowering memory size, power consumption, and latency and more importantly provides higher throughput N vertical partitions Inpulword · L layers r;! r;! r;! r;! ��L2J� 1 HP1N r;! � r;! � C Partition input word of C bits into N subwords; with each subword is of w bits SWN w-bit w-bit w-bit w-bit w-bit w-bit CAM Priority Encoder MA · 1 HPLN Fig Conceptual view of hybrid partitioning (HP) (L: # of layers, N: # of vertical partitions) III HYB RID PARTITIONING We use hybrid partitioning (HP), shown Fig 1, to di vide conventional TCAM table horizontally and vertically to construct hybrid partitions Vertical partitioning (VP) in HP divides TCAM word of C bits into N subwords Horizontal partitioning (HrP) in HP divides each vertical partition into L horizontal partitions by using the original address range of conventional TCAM table Thus, HP results in a total of L x N hybrid partitions Dimensions of each hybrid partition are K x w where K is a subset of original addresses and w is the number of bits in a subword VP is used to use as lower memory as possible HrP cannot be used alone because it needs very huge memory size Thus, HrP is not feasible owing to inefficiency in terms of area, power, and cost; however, it nicely generates layers Hybrid partitions that span the same address range are grouped in the same layer For example, HP3 ], HP32 , HP33 , , and HP3 N are in layer IV ARCHI T ECTURE OF UE-TCAM A Overall architecture Fig shows the overall architecture where each layer represents the layer architecture given in Fig UE-TCAM has L layers and a CAM priority encoder (CPE) Output of each layer is a potential match address (PMA) The PMAs are fed to CPE, which selects match address (MA) among PMAs B Fig Architecture of UE-TCAM Layer architecture is shown in Fig (L: # of layers, sw: subword, w: # of bits in subword, C: # of bits in the input word, PMA: potential match address, and MA: match address) Layer architecture Layer architecture of the proposed TCAM is illustrated in Fig Its components include N SRAM units, K-bit AND operation, and a layer priority encoder (LPE) 1) SRAM unit: Each SRAM unit has a size of 2w_ wordsxK-bit where K is the subset of original addresses from conventional TCAM Maximum possible combinations of w bits are 2w where each combination represents a subword and in our proposed TCAM, each subword acts as an address to its corresponding SRAM unit that invokes its corresponding row of K bits Composition of the SRAM unit in the proposed architecture is shown in Table I where shows the presence of a subword at an original address Fig Architecture of a layer of UE-TCAM (N: # of subwords, LPE: layer priority encoder, K: width of SRAM unit, (sw: subword, w: # of bits in a subword, PMA: potential match address, and MA: match address) TABLE Addresses I 2"" - I COMPOSITION OF THE SRAM UNIT IN UE-TCAM I O'n I I" I I I 0 Original address positions 2nn 3' I I 1 0 I 4,n 0 1 I (K_I)'n I I 0 I 2) K-bit AND operation: K bits rows are read out by their corresponding subwords, which are then bit-wise ANDed and the result is then forwarded to LPE for further processing Possible PMA is present among the result of K-bit AND operation The result is then forwarded to LPE for result generation in the form of PMA 3) Layer priority encoder: Since we emulate TCAM and as in TCAM multiple matches may occur [23], LPE is used to select PMA in the output of K-bit AND operation V UE- TCAM OPERATIONS A Data mapping operation Tradition TCAM table is logically partitioned column-wise (vertically) and row-wise (horizontally) into TCAM sub-tables using hybrid partitioning [15] A partition may contain an x bit, which is first expanded into binary bits (0 and 1) Each subword, acting as an address, is applied to its corresponding SRAM unit and K bits are written at the memory location TABLE II TRADITIONAL 00 01 I 01 SRAM LAYER OF UE-TCAM TABLE Ill DATA MAP PING EXAMPLE: Layer SRAM unit" I 0 I I I I Layer 0 I 0 I I SRAM unit21 0 I I I 0 I 0 I I SRAM unit22 I 0 I I 0 Data searching operation 1) Data searching operation in a layer: Algorithm de scribes lookup operation in a layer of of the proposed UE TCAM The N subwords act as addresses and read out their memory locations from their respective SRAM units, which are then bit-wise ANDed LPE selects PMA; otherwise, mismatch occurs in the layer 2) Overall data searching operation: Overall search oper ation follows Algorithm A search key is applied to UE TCAM, which is then divided into N subwords to be searched in their corresponding SRAM units in all layers in parallel Algorithm uses Algorithm at step CPE selects MA among PMAs; otherwise, mismatch occurs Input: Search in a layer of UE-TCAM N subwords where each subword is of PMA w bits Output: 1: 2: 3: Input: C 2: 3: bits Divide search key into N subwords; each of All layers use Algorithm in parallel MAimismatch occurs V I UE- TCAM = 10 PMAs PMAI - 10 \0 II Layer SRAM unit21 - SRAM unit22 = 10 PMA2 - 10 II I OVERALL DATA SEARCH OPERATION IN UE-TCAM Activity Search key PMAI SubwordI - 00II 00 and - = Subword2 and PMA2 ePE selects address - II = as MA MISMATCH CASE WHEN THE RESULT OF K-BIT AND OPERATION IS IN UE-TCAM I Activity SubwordI I - 0I, Subword2 Read out data from SRAM unitll - Read out data from SRAM Unitl2 = K-bit AND operation result - 00 10 0I Since the result of K-bit AND operation is mismatch has occurred in layer - II 0, We select N = 2, L = 2, K = 2, and w = After necessary processing, HPII, HPI2 , HP21, and HP22 are mapped to their corresponding SRAM units Mapped memory units are shown in Table III The mapped bits are high, while remaining bits are low For example, subword 00 is available on address in conventional TCAM table The subword 00 has been mapped in SRAM unitll where its corresponding bit has been set to high at address O B Data searching example 1) Match case: We use memory units given in Table III to be searched Table IV provides an example of search operation in layers and 2, where lookup operation in each layer follows Algorithm Table V provides overall search operation in UE TCAM, which follows Algorithm We provide input word 0011 for searching UE-TCAM finds a match for the input word in layer at location and in layer at location Thus, we have PMAI from layer and PMA2 from layer CPE selects PMAI as MA, considering that it has the highest priority = 2) Mismatch case: During a search operation in a layer, mismatch of the input word can occur when none of the bits is high after K-bit AND operation Table VI shows a mismatch case in layer Output: 1: - SRAM Unitl2 - = Overall search in UE-TCAM Search key of MA SRAM unitll = Read all SRAM units concurrently ANDK = K[ & K2 & K3 & KN PMAImismatch occurs Algorithm Subword2 K-bit ANDing result: IN UE-TCAM I Read out data from: I Steps Algorithm Layer Read out data from: TABLE VI Thus, in this way, all the memory units are mapped A subword in a partition may be present at multiple locations So, its original addresses are mapped to the corresponding bits in their respective memory units Mapped bits are high, while remaining bits are set to low B AND LAYER AND Steps 00 - TABLE V Original addresses SRAM unit12 UNITS IN LAYER Activity SubworddI HP22 SEARCHING IN LAYER I Steps I HPI2 II Ix HP21 I Layer II HP" Ox II TABLE IV TABLE WITH HYBRID PARTITIONS Hybrid partitions Address Address TCAM w bits EXAMPLE A Data mapping example We use Table II to be mapped to the proposed UE-TCAM Table II also shows its hybrid partitions We take a simple example of x conventional TCAM table and divide it into four hybrid partitions; each one has a size of 2-wordsx2-bit V II IMPLEMEN TATION RESULTS AND PERFORMANCE EVALUATION A Implementation results A sample design of 512 x 36 of the proposed UE-TCAM and the available SRAM-based TCAMs-HP-TCAM [15], z TCAM [16], and E-TCAM [17] with L=4 and N=4 was implemented on Xilinx Virtex-6 FPGA Table VII shows im plementation results of all the SRAM-based TCAMs Dynamic power consumption for a lookup operation was measured with V core voltage and 100 MHz clock speed We measured power consumption using Xilinx Xpower analyzer [24] We generated switching activity interchange format (SAIF) file, TABLE VII HP-TCAM Results [15] Z-TCAM 2057 5326 118.1 16, 48 102.17 865.07 SRs LUTs Speed (MHz) B-RAMs IMPLEMEN TATION RESULTS ON XILINX (18K, 36K) Energy (fJ/bitisearch) EDP (ns.fJ/bitisearch) Latency (Clock cycles) [16] 4000 :2 u ro 600 � Ez:::zJ IIIIIII mIIIIIIl o 2000 Q) (/) :;:, :0 HP-TCAM Z-TCAM E-TCAM UE-TCAM �