DESIGN AND MANAGMENT OF SCALABLE FPGA ARCHITECTURES

DESIGN AND MANAGEMENT OF SCALABLE FPGA ARCHITECTURES RIZWAN SYED (M.Sc. Elect. & Comp. Engg, NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING NATIONAL UNIVERSITY OF SINGAPORE 2013 Declaration I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. RIZWAN SYED DECEMBER 02, 2013 iii Acknowledgments I have always regarded the journey as being more important than the destination itself. While for PhD the destination is surely desired, the importance of the journey cannot be underestimated. At the end of this long road, I am thankful to Allah (SWT) for helping me throughout my PhD. Then, I would like to express my sincere gratitude to all those who supported me all the years and made this journey enjoyable. Without their help and support, this thesis would not have reached its current form. First of all I would like to thank Dr. Yajun Ha, my supervisor, for mentoring me all through the journey of PhD. He has always motivated me all through my research and has constantly made me think of how I can improve my ideas and apply them in a more practical way. His eye for details helped me maintain a high quality of my research. Despite being a very busy person, he always ensured that we had enough time for regular discussions. Whenever I needed something done urgently, whether it was feedback on a draft or filling some form, he always gave it utmost priority. He often worked in holidays and weekends to give me feedback on my work in time. Further, I would like to thank A/P Bharadwaj Veeravalli for all the support and supervision during all the years of my PhD. He was always an inspiration for me and always gave me useful insight into research methodology, and critical comments on my publications throughout my PhD project. I was very fortunate to have two supervisors who were all very hard working and motivating. v My thanks also extend to all my friends most notably Xiaolei Chen, Zhao Wenfeng, Yu Heng, and Rajesh Panicker for all the support and all the nice discussions and constructive feedback that I got from them. I would like to thank my family and friends for their interest in my project and the much needed support. I would especially like to thank my parents without whom I would not have been able to achieve this result. Last but not least, I would like to thank my wife for her ongoing support and enduring all the troubles during my PhD. Rizwan Syed vi Summary Field Programmable Gate Arrays (FPGAs) have been demonstrated in various applications to deliver significant performance speed-ups as compared to other software-centric platforms and in some case with significant improvement in efficiency as well. Although, FPGAs are slower and inefficient than Application-Specific Integrated Circuits (ASICs), yet their disadvantage is offset by their on-the-field programmability, low non-recurring engineering costs and fast-turnaround times. However, designing larger FPGAs is getting more difficult even with the ongoing advancement in semiconductor manufacturing technology. Global wire delay has emerged as a problem for both ASICs and FPGAs. FPGAs conventionally implement global interconnects using long wires and switch blocks, which make them slow and inefficient. Another issue is clock skew in the distribution of the global clock for larger FPGAs. Furthermore, the increasing area overheads due to larger programmable routing-interconnect required to support increasing computation resources make the scalability an issue for existing FPGA architecture. Another issue is decreasing yields with increasing FPGA die sizes. This adversely affect the manufacturing cost of the device. On the other hand, as the resources in FPGAs increase, the runtime of the implementation tools also increase. This can be mitigated by adopting hierarchical design methodology and reusing of implemented sub-components. The later will require only the new components to be implemented and thus will require lesser time to implement. However, this would require runtime resource vii management of FPGA. Existing tools that support resource management of FPGA impose various constraints and overheads that make their proposals practically less useful. Furthermore, to improve the reliability of the system, such a resource manager should be self-aware by measuring physical quantity such as the temperature profile of a FPGA die. This would allow the manager to perform thermal management by adjusting performance of running tasks. Envisioning that reconfigurable computing will be a part of mainstream computing, this thesis presents a scalable FPGA architecture to mitigate the issues with scalability of single FPGAs and multi-FPGA systems. Furthermore, it proposes an abstract architecture for resources management of scalable FPGA architecture supporting the abstraction of both computing and communication resources. This thesis makes the following contributions. First, to solve the issues with scalability of FPGAs, we propose a novel scalable FPGA architecture and its design methodology. This architecture allows us to model both single FPGAs and multi-FPGA systems under a single architecture. In this architecture, we partition a FPGA into multiple tiles. We assume a tile to be a current generation FPGA. These tiles use a hierarchical network for routing between them and thus separate local routing from global. Due to the use of a hierarchical network, the architecture can be easily scaled just by the addition of more tiles and higher-levels. Second, to enable efficient resource management on the scalable FPGA architecture, we initially develop a low overhead framework that supports important features such as dynamically sized reconfigurable regions, abstraction in communication among hardware applications, clock network management and in-circuit debugging for hardware applications. This allows us to abstract the computational and communication resources of our scalable FPGA architecture. Third, to extend our framework to multi-FPGA systems, we develop a scalable AXI interconnect for existing FPGAs. The scalable AXI interconnect uses high-speed serial links for inter-FPGA communications. Packet switching is used to deliver packets from source FPGA to destination FPGA using one or more hops. The interconnect has a low area and transport overheads while offering high-bandwidth and low latency as compared to other interconnects. Finally, we implement a LUT-based temperature sensor that has a resolution of 0.5◦ C and an accuracy of ±0.5◦ C using two-point calibration. It utilizes viii 52% less resources than the state-of-the-art. We use an array of such sensors to monitor the thermal profile of the FPGA die and then use it to and adjust performance of running applications as required through dynamic frequency scaling. ix x 6. THERMAL MANAGEMENT FOR RECONFIGURABLE PLATFORMS can be augmented with runtime resource management to improve the reliability of a multi-FPGA platform. 102 CHAPTER Conclusions and Future Work In this chapter, we present the major conclusions from this thesis from the perspective of scalability, productivity and reliability of reconfigurable systems. Then we highlight several issues that remain to be solved. As mentioned earlier, FPGAs have been demonstrated in various applications to deliver significant performance speed-ups as compared to other software centric platforms and in some case with significant improvement in efficiency as well. Although, FPGAs are slower and inefficient as compared to ASICs, yet their disadvantage is offset by their on-the-field programmability, low non-recurring engineering costs and fast-turnaround times. However, designing larger FPGAs is getting more difficult even with the ongoing advancement in semiconductor manufacturing technology. Global wire delay has emerged as a problem for both ASICs and FPGAs. FPGAs conventionally implement global interconnects using long wires and switch blocks, which make them slow and inefficient. Another problem is due to clock skew in the distribution of the global clock for larger FPGAs. Furthermore, increasing area overheads due to programmable routing interconnect (to support increasing computation resources) make the scalability an issue for existing FPGA architecture. Another issue is decreasing yields with increasing FPGA die sizes. This adversely affect the manufacturing cost of the device. We addressed all these 103 7. CONCLUSIONS AND FUTURE WORK issues by designing a scalable FPGA architecture as mentioned in Chapter 3. It uses globally synchronous locally asynchronous paradigm to overcome the issues with long wire delays as well as clock skew. This allows the communications within a localized areas, called tiles, to be synchronized locally to a clock while adopting asynchronous communication methodology using a hierarchical network for globally communications. By reusing interconnecting wires in the hierarchical interconnect, the area requirement for the interconnect is reduced. This has the added benefit of traffic localization. The architecture can be easily scaled by the adding of higher level and tiles. Then a design methodology is proposed to allow existing designs to be mapped to this architecture. The architecture was then successfully demonstrated using an emulation prototype. The experimental results and predicted values reported show the effectiveness of the new architecture. The proposed architecture not only solves the issue related to global wire delay but also solves the issue of reduction in yields with increasing die sizes. In the architecture, each tiles can be essentially fabricated as a separate silicon and integrated together on a larger substrate using 3D-IC fabrication. As proof of this concept, Virtex [5] uses the same methodology to integrate multiple dies onto a single substrate [17]. Thus, Virtex can be considered as an instance of this architecture. However, as reconfigurable systems scale, the runtime of the implementation tools also increase which adversely affects productivity. This can be mitigated by adopting hierarchical design methodology and also by reusing of implemented sub-components. The later will require only the new components to be implemented and thus will require lesser time to implement. However, this would require runtime resource management of FPGA. Existing tools that support resource management of FPGA impose various constraints and overheads that make their proposals practically less useful. To address this issue, we designed a practical and efficient resource management framework in Chapter 4. It supported the abstraction of both computation and communication resources of the FPGA. The communication among the sub-components was enabled through the use of a deterministic network-on-chip. Furthermore, it supported practically important features like dynamic sized regions, clock management, and per-application on-chip debug capability. To support this framework, a design methodology was also proposed that used existing implementation tools without any modifications. Then, the effectiveness of the framework 104 7. CONCLUSIONS AND FUTURE WORK was demonstrated using a prototype. It was shown that the area overheads of using such a framework was low while the performance overhead depended up on the application model. If a data flow model is used, then the overheads are negligible. Else the overheads will be significant. It was argued that this is a non-issue as most well-designed applications are modular and can use data-flow model easily. Although the proposed framework did enable efficient runtime resource management but it was limited to only one FPGA. The problem of increasing implementation times is more pronounced for multi-FPGA systems. Therefore, in Chapter 5, we extended the framework to allow the benefits of runtime resource management to be utilized on multi-FPGA platforms. A scalable interconnect was designed supporting the popular industry standards, AXI4 and AXI4-Stream. It used a hybrid-network comprising of both circuit-switching and packet-switching networks. On-chip routing was enabled by hierarchical circuit-switching network supporting deterministic timing in communications. On the hand, off-chip communication was achieved through packet-switching network utilizing high-speed transceivers. Most scalable network topologies were supported by the interconnect. This made the interconnect scalable as it was shown to be cost effective and easier to scale as compared parallel interfaces and other interconnects. Furthermore, the interconnect not only abstracted communication among hardware task, but also the abstracted the address space so that different task can use the same space while avoiding any conflicts. This allowed the system to contain very large amounts of memory without requiring large addresses by reusing the same address space. The interconnect was demonstrated in conjunction with AARP. The performance of the interconnect was comparable to the high-performance interconnect Infiniband while the overheads was comparable with existing commercial multi-FPGA solutions. Although the solutions presented earlier addressed some of the major issues with scalability and productivity of reconfigurable system, yet reliability remained unaddressed. As systems scale in terms of both process technology and size, reliability become a growing concern. Temperature is one of the factors that affect the reliability of a device and this can be mitigated through thermal management. In case of FPGAs, this is not possible due to lack of multiple sensors to monitor the temperature profile of the die. Furthermore, runtime resource management makes the estimation of power dissipation impractical. Therefore, in Chapter 6, we developed a low overhead temperature sensor that 105 7. CONCLUSIONS AND FUTURE WORK had low resource utilization and good accuracy. A comparative study showed that it provided comparable performance while using 52% less resources than the state-of-the-art. The effects of resource utilization and placement on the sensor was also studied. Then, an array of such sensors was used to monitor the thermal profile of the FPGA. This information was used to control the operating frequencies of test applications through dynamic frequency scaling. It was demonstrated that FPGA device could operate reliably with the proposed thermal management solution. 7.1 Future Work While this thesis presents solutions to various problems in the analysis and design of scalable FPGA architectures and its resource management, a number of issues remain to be solved. Some of these are listed below. 1. Automated Tools: Works presented this thesis are aimed towards improving productivity with scalable reconfigurable computing. The proposed framework can be improved even further by developing automated tools that allow designers to use develop instances of the framework effortlessly. Furthermore, such tools could also guide a developer through the design methodologies presented in this thesis and assist in the development of hardware applications. 2. Thermal-Aware Resource Allocation: Although thermal management support was developed as part of this thesis, yet there is no existing allocator than can fully utilize it’s potential. By extending resource allocators to support thermal management, we are essentially allowing the runtime manager to estimate the performance of the applications under thermal constraints and thus allow it to provide potentially better and feasible solutions to runtime allocation problem. 3. Distributed/Scalable Resource Allocator: The works presented in this thesis focused on the implementation of a framework to support scalable runtime resource management. However, this also requires a distributed and scalable resource allocator that can schedule applications on the multi-FPGA platforms efficiently at runtime while maximizing their 106 7. CONCLUSIONS AND FUTURE WORK performance. Such an allocator would improve the scalability of the overall system. 4. Support for Data Flow Graphs: Existing allocators for FPGAs typically take single tasks and schedule it on the platform. Therefore, they not essentially support mapping applications with sub-tasks connected as data flow graphs. The motivation to develop efficient resource management in this thesis was reuse of implemented sub-components. Therefore, this problem will not be fully solved without developing such an allocator that can allocate such application on FPGAs. 5. Integration with Existing Operating Systems: Integration with existing operating system will enable developers to write applications that are composed of software as well as hardware. This is analogous to OpenCL, DX Compute APIs for the graphic processing units (GPUs) in existing operating systems. This will increase the abstraction provided by the framework and thus improve productivity. 6. Support for Quality of Service: The interconnect presented in Chapter routes off-chip traffic using best-effort. Therefore, applications requiring guarantees in communication can be only be allocated if their traffic is routed locally within an FPGA. However, if the communication is cannot be routed locally, such application cannot be allocated. By extending the interconnect to support quality-of-service, such applications could be allocated safely even if the traffic is not routed locally. 7. Support for Real-Time Applications: Since deadlines are not considered by most existing allocators, real-time tasks cannot be scheduled on such platforms. By supporting real-time constraints, we can extend the utility of such frameworks to many real-world problems requiring realtime guarantees. The above are some of the issues that need to be solved to take the scalable reconfigurable computing into the next era. 107 7. CONCLUSIONS AND FUTURE WORK 108 Bibliography [1] I. Ferain, C. A. Colinge, and J.-P. Colinge, “Multigate transistors as the future of classical metal-oxide-semiconductor field-effect transistors,” Nature, vol. 479, no. 7373, pp. 310–316, 2011. [2] Sematech, International Technology Roadmap for Semiconductors (ITRS2005). Sematech, 2005. [3] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 26, no. 2, pp. 203–215, Feb 2007. [4] W. Carter, K. Duong, R. H. Freeman, H. Hsieh, J. Y. Ja, J. E. Mahoney, L. T. Ngo, and S. L. Sze, “A user programmable reconfiguration gate array,” in Custom Integrated Circuits Conference (CICC1986), IEEE Proceedings of, May 1986, pp. 233–235. [5] Xilinx, Virtex FPGA Datasheet. Xilinx, 2014, vol. DS180. [6] ——, Zynq-7000 All Programmable SoC Datasheet. no. DS190. Xilinx, Sep. 2013, [7] G. E. Moore, “Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff.” Solid-State Circuits Society Newsletter, IEEE, vol. 11, no. 5, pp. 33–35, 2006. [8] S. Kestur, J. Davis, and O. Williams, “BLAS Comparison on FPGA, CPU and GPU,” in VLSI (ISVLSI), 2010 IEEE Computer Society Annual Symposium on, July 2010, pp. 288–293. 109 BIBLIOGRAPHY [9] R. Kalarot and J. Morris, “Comparison of FPGA and GPU implementations of real-time stereo vision,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, June 2010, pp. 9–15. [10] Y. Zhang, Y. Shalabi, R. Jain, K. Nagar, and J. Bakos, “FPGA vs. GPU for sparse matrix vector multiply,” in Field-Programmable Technology, 2009. FPT 2009. International Conference on, Dec. 2009, pp. 255–262. [11] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance comparison of FPGA, GPU and CPU in image processing,” in Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, Aug 2009, pp. 126–131. [12] R. Spurzem, P. Berczik, G. Marcus, A. Kugel, G. Lienhart, I. Berentzen, R. Mnner, R. Klessen, and R. Banerjee, “Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU),” Computer Science - Research and Development, vol. 23, no. 3-4, pp. 231–239, 2009. [Online]. Available: http://dx.doi.org/10.1007/s00450-009-0081-9 [13] C. Grozea, Z. Bankovic, and P. Laskov, “FPGA vs. Multi-core CPUs vs. GPUs: Hands-On Experience with a Sorting Application,” in Facing the Multicore-Challenge, ser. Lecture Notes in Computer Science, R. Keller, D. Kramer, and J.-P. Weiss, Eds. Springer Berlin Heidelberg, 2010, vol. 6310, pp. 105–117. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-16233-6 12 [14] D. Chen and D. Singh, “Using OpenCL to evaluate the efficiency of CPUS, GPUS and FPGAS for information filtering,” in Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, Aug 2012, pp. 5–12. [15] ——, “Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms,” in Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, Jan 2013, pp. 297– 304. [16] Altera, Implementing FPGA Design with the OpenCL Standard. Nov. 2012, no. WP-01173. [17] Xilinx, Xilinx Stacked Silicon Interconnect Technology. 2012, no. WP380. [18] ——, Virtex FPGA Datasheet. Xilinx, 2007, vol. DS301. [19] ——, Virtex FPGA Datasheet. Xilinx, 2010, vol. DS112. 110 Altera, Xilinx, Dec. BIBLIOGRAPHY [20] ——, Virtex FPGA Datasheet. Xilinx, 2009, vol. DS100. [21] ——, Virtex FPGA Datasheet. Xilinx, 2012, vol. DS150. [22] Altera. (2014) About Stratix Family High-End FPGAs and SoCs. Altera. [Online]. Available: http://www.altera.com/devices/fpga/stratix-fpgas/about/stx-about.html [23] A. DeHon and R. Rubin, “Design of FPGA interconnect for multilevel metallization,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 12, no. 10, pp. 1038–1050, Oct 2004. [24] S. Xydis, K. Pekmestzi, D. Soudris, and G. Economakos, “A High Level Synthesis Exploration Framework with Iterative Design Space Partitioning,” in VLSI 2010 Annual Symposium, ser. Lecture Notes in Electrical Engineering, N. Voros, A. Mukherjee, N. Sklavos, K. Masselos, and M. Huebner, Eds. Springer Netherlands, 2011, vol. 105, pp. 117–131. [Online]. Available: http://dx.doi.org/10.1007/978-94-007-1488-5 [25] A. Jantsch and H. Tenhunen, “Will Networks on Chip Close the Productivity Gap?” in Networks on Chip, A. Jantsch and H. Tenhunen, Eds. Springer US, 2003, pp. 3–18. [Online]. Available: http://dx.doi.org/10.1007/0-306-48727-6 [26] Xilinx, Hierarchical Design Methodology Guide. UG748. Xilinx, 2013, no. [27] T. Instruments, “Understanding Integrated Circuit Package Power Capabilities,” Texas Instruments, Tech. Rep., May 2004. [Online]. Available: http://www.ti.com/lit/an/snva509a/snva509a.pdf [28] A. Gupte and P. Jones, “Hotspot Mitigation Using Dynamic Partial Reconfiguration for Improved Performance,” in Reconfigurable Computing and FPGAs, 2009. ReConFig ’09. International Conference on, Dec 2009, pp. 89–94. [29] M. Pecht, Handbook of Electronic Package Design, ser. Dekker Mechanical Engineering. Taylor & Francis, 1991. [30] Altera, FLEX 10K embedded programmable logic device family. Jan. 2003, no. DS-F10K-4.2. Altera, [31] ——, APEX 20K programmable logic device family data sheet. Mar. 2004, no. DS-APEX20K-5.1. Altera, 111 BIBLIOGRAPHY [32] R. Syed, X. Chen, Y. Ha, and B. Veeravalli, “sFPGA2 - A scalable GALS FPGA architecture and design methodology,” in Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, Aug 2009, pp. 314–319. [33] R. Syed, Y. Ha, and B. Veeravalli, “A Low Overhead Abstract Architecture for FPGA Resource Management,” ACM SIGARCH Computer Architecture News/HEART’12, vol. 40, no. 5, pp. 28–33, Mar. 2012. [Online]. Available: http://doi.acm.org/10.1145/2460216.2460222 [34] R. Syed, Z. Wenfeng, Y. Ha, and B. Veeravalli, “A Low Overhead Temperature Sensor for Self-Aware Reconfigurable Platforms,” in Proceedings of the Workshop on Self-Awareness in Reconfigurable Computing Systems 2012 (SRCS), 2012, pp. 314–319. [35] K. Andres, “MOS programmable logic arrays,” A Texas Instruments Application Report, vol. -, pp. 892–901, Oct. 1970. [36] K. Shahookar and P. Mazumder, “VLSI Cell Placement Techniques,” ACM Computing Surveys, vol. 23, no. 2, pp. 143–220, June 1991. [Online]. Available: http://doi.acm.org/10.1145/103724.103725 [37] V. Gudise and G. Venayagamoorthy, “FPGA placement and routing using particle swarm optimization,” in VLSI, 2004. Proceedings. IEEE Computer society Annual Symposium on, Feb 2004, pp. 307–308. [38] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms, 2nd ed. McGraw-Hill Higher Education, 2001. [39] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges,” Foundations and Trends in Electronic Design Automation, vol. 2, no. 2, pp. 135–253, Feb. 2008. [Online]. Available: http://dx.doi.org/10.1561/1000000005 [40] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for DeepSubmicron FPGAs. Norwell, MA, USA: Kluwer Academic Publishers, 1999. [41] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deepsubmicron FPGA performance and density,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 12, no. 3, pp. 288–298, March 2004. [42] H. Schmit, “Extra-dimensional Island-Style FPGAs,” in Field Programmable Logic and Application, ser. Lecture Notes in Computer Science, P. Cheung and G. Constantinides, Eds., vol. 2778. 112 BIBLIOGRAPHY Springer Berlin Heidelberg, 2003, pp. 406–415. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-45234-8 40 [43] J. Babb, R. Tessier, and A. Agarwal, “Virtual wires: overcoming pin limitations in FPGA-based logic emulators,” in FPGAs for Custom Computing Machines, 1993. Proceedings. IEEE Workshop on, Apr 1993, pp. 142–151. [44] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal, “Logic emulation with virtual wires,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 16, no. 6, pp. 609–626, Jun 1997. [45] A. Royal and P. Cheung, “Globally Asynchronous Locally Synchronous FPGA Architectures,” in Field Programmable Logic and Application, ser. Lecture Notes in Computer Science, P. Cheung and G. Constantinides, Eds., vol. 2778. Springer Berlin Heidelberg, 2003, pp. 355–364. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-45234-8 35 [46] X. Jia and R. Vemuri, “The GAPLA: a globally asynchronous locally synchronous FPGA architecture,” in Field-Programmable Custom Computing Machines, 2005. FCCM 2005. 13th Annual IEEE Symposium on, April 2005, pp. 291–292. [47] ——, “CAD tools for a globally asynchronous locally synchronous FPGA architecture,” in VLSI Design, 2006. Held jointly with 5th International Conference on Embedded Systems and Design., 19th International Conference on, Jan 2006, pp. pp.–. [48] S. Fernando, X. Chen, and Y. Ha, “sFPGA - A scalable switch based FPGA architecture and design methodology,” in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, Sept 2008, pp. 95–100. [49] A. Ahmadinia, C. Bobda, S. Fekete, J. Teich, and J. van der Veen, “Optimal Free-Space Management and Routing-Conscious Dynamic Placement for Reconfigurable Devices,” Computers, IEEE Transactions on, vol. 56, no. 5, pp. 673–680, May 2007. [50] G. Wigley and D. Kearney, “The Development of an Operating System for Reconfigurable Computing,” in Field-Programmable Custom Computing Machines, 2001. FCCM ’01. The 9th Annual IEEE Symposium on, April 2001, pp. 249 –250. [51] C. Steiger, H. Walder, and M. Platzner, “Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks,” Computers, IEEE Transactions on, vol. 53, no. 11, pp. 1393–1407, Nov 2004. 113 BIBLIOGRAPHY [52] K. Danne, R. Muhlenbernd, and M. Platzner, “Server-based execution of periodic tasks on dynamically reconfigurable hardware,” Computers Digital Techniques, IET, vol. 1, no. 4, pp. 295–302, July 2007. [53] R. Pellizzoni and M. Caccamo, “Real-Time Management of Hardware and Software Tasks for FPGA-based Embedded Systems,” Computers, IEEE Transactions on, vol. 56, no. 12, pp. 1666–1680, Dec 2007. [54] M. Majer, J. Teich, A. Ahmadinia, and C. Bobda, “The Erlangen Slot Machine: A Dynamically Reconfigurable FPGA-based Computer,” The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, vol. 47, no. 1, pp. 15–31, 2007. [Online]. Available: http://dx.doi.org/10.1007/s11265-006-0017-6 [55] E. Lübbers and M. Platzner, “ReconOS: Multithreaded Programming for Reconfigurable Computers,” ACM Transactions on Embedded Computing Systems (TECS), vol. 9, no. 1, pp. 8:1–8:33, Oct 2009. [Online]. Available: http://doi.acm.org/10.1145/1596532.1596540 [56] A. Oetken, S. Wildermann, J. Teich, and D. Koch, “A Bus-Based SoC Architecture for Flexible Module Placement on Reconfigurable FPGAs,” in Field Programmable Logic and Applications (FPL), 2010 International Conference on, Aug 2010, pp. 234–239. [57] L. Benini and G. De Micheli, “Networks on chips: a new SoC paradigm,” Computer, vol. 35, no. 1, pp. 70–78, Jan 2002. [58] C. Bobda, A. Ahmadinia, M. Majer, J. Teich, S. Fekete, and J. van der Veen, “DyNoC: A dynamic infrastructure for communication in dynamically reconfugurable devices,” in Field Programmable Logic and Applications, 2005. International Conference on, Aug 2005, pp. 153–158. [59] T. Pionteck, R. Koch, and C. Albrecht, “Applying Partial Reconfiguration to Networks-On-Chips,” in Field Programmable Logic and Applications, 2006. FPL ’06. International Conference on, Aug 2006, pp. 1–6. [60] L. Devaux, S. Ben Sassi, S. Pillement, D. Chillet, and D. Demigny, “DRAFT: Flexible interconnection network for dynamically reconfigurable architectures,” in Field-Programmable Technology, 2009. FPT 2009. International Conference on, Dec 2009, pp. 435–438. [61] S. Jovanović, C. Tanougast, C. Bobda, and S. Weber, “CuNoC: A dynamic scalable communication structure for dynamically reconfigurable FPGAs,” Microprocessors and Microsystems, vol. 33, no. 1, pp. 24 – 36, 2009, selected Papers from ReCoSoC 2007 (Reconfigurable Communication-centric Systems-on-Chip). [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0141933108000835 114 BIBLIOGRAPHY [62] B. Ahmad, A. Erdogan, and S. Khawam, “Architecture of a Dynamically Reconfigurable NoC for Adaptive Reconfigurable MPSoC,” in Adaptive Hardware and Systems, 2006. AHS 2006. First NASA/ESA Conference on, June 2006, pp. 405–411. [63] M. Majer, C. Bobda, A. Ahmadinia, and J. Teich, “Packet Routing in Dynamically Changing Networks on Chip,” in Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, April 2005, pp. 154b–154b. [64] M. Inagi, Y. Takashima, and Y. Nakamura, “Globally optimal timemultiplexing in inter-FPGA connections for accelerating multi-FPGA systems,” in Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, Aug 2009, pp. 212–217. [65] J. D. Davis, C. P. Thacker, and C. Chang, “BEE3: Revitalizing Computer Architecture Research,” Microsoft Research, Tech. Rep., 2009. [66] O. Pell and V. Averbukh, “Maximum Performance Computing with Dataflow Engines,” Computing in Science Engineering, vol. 14, no. 4, pp. 98–103, July 2012. [67] P. Jones, J. Moscola, Y. Cho, and J. Lockwood, “Adaptive Thermoregulation for Applications on Reconfigurable Devices,” in FPL 2007 Intl. Conf., Aug. 2007, pp. 246–253. [68] P. Jones, Y. Cho, and J. Lockwood, “Dynamically Optimizing FPGA Applications by Monitoring Temperature and Workloads,” in VLSI Design, 2007. Held jointly with 6th International Conference on Embedded Systems., 20th International Conference on, Jan 2007, pp. 391–400. [69] D. Atienza and E. Martinez, “Inducing Thermal-Awareness in Multicore Systems Using Networks-on-Chip,” in VLSI, 2009. ISVLSI ’09. IEEE Computer Society Annual Symposium on, May 2009, pp. 187–192. [70] X. Zhang, W. Jouini, P. Leray, and J. Palicot, “Temperature-Power Consumption Relationship and Hot-Spot Migration for FPGA-Based System,” in Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int’l Conference on Int’l Conference on Cyber, Physical and Social Computing (CPSCom), Dec. 2010, pp. 392–397. [71] P. Chen, M.-C. Shie, Z.-Y. Zheng, Z.-F. Zheng, and C.-Y. Chu, “A Fully Digital Time-Domain Smart Temperature Sensor Realized With 140 FPGA Logic Elements,” Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 54, no. 12, pp. 2661–2668, Dec 2007. 115 BIBLIOGRAPHY [72] P. Chen, S.-C. Chen, Y.-S. Shen, and Y.-J. Peng, “All-Digital TimeDomain Smart Temperature Sensor With an Inter-Batch Inaccuracy of 0.7C-0.6C After One-Point Calibration,” Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 58, no. 5, pp. 913–920, May 2011. [73] C. Ruething, A. Agne, M. Happe, and C. Plessl, “Exploration of ring oscillator design space for temperature measurements on FPGAs,” in Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, Aug 2012, pp. 559–562. [74] Z. Chen, R. Nagesh, A. Reddy, and P. Schaumont, “Increasing the Sensitivity of On-Chip Digital Thermal Sensors with Pre-Filtering,” in VLSI, 2009. ISVLSI ’09. IEEE Computer Society Annual Symposium on, May 2009, pp. 304–309. [75] C. Leiserson, “Fat-trees: Universal networks for hardware-efficient supercomputing,” Computers, IEEE Transactions on, vol. C-34, no. 10, pp. 892– 901, Oct 1985. [76] ISO, ISO/IEC 10918-1:1994: Information technology — Digital compression and coding of continuous-tone still images: Requirements and guidelines. Geneva, Switzerland: ISO, 1994. [77] B. W. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,” Bell System Technical Journal, vol. 49, no. 2, pp. 291–307, 1970. [Online]. Available: http://dx.doi.org/10.1002/j.1538-7305.1970.tb01770.x [78] Xilinx, XUPV2P User Guide. Xilinx, Apr. 2008, no. UG069. [79] SATA-IO, Serial ATA 2.5 Specification. Serial ATA International Organization, 2005. [80] W. J. Kim and Y.-B. Kim, “Wave Pipelined Circuits Synthesis,” in Instrumentation and Measurement Technology Conference, 2005. IMTC 2005. Proceedings of the IEEE, vol. 1, May 2005, pp. 32–36. [81] Xilinx, XUPV5/ML505/ML506/ML507 User Guide. no. UG347. Xilinx, Oct. 2008, [82] K. Xu, H.264/AVC Baseline Decoder. OpenCores, Oct. 2009. [83] Xilinx, Fast Fourier Transform v7.1. Xilinx, Apr. 2003, no. DS260. [84] ARM, AMBA R AXI and ACE Protocol Specification. no. IHI0022D. 116 ARM, Oct. 2011, BIBLIOGRAPHY [85] ——, AMBA R AXI4-Stream Protocol. IHI0051A. [86] Xilinx, ML605 User Guide. ARM, Mar. 2010, no. Xilinx, Jul. 2011, no. UG534. [87] P. Mangalagiri, S. Bae, R. Krishnan, Y. Xie, and V. Narayanan, “Thermalaware reliability analysis for Platform FPGAs,” in Computer-Aided Design, 2008. ICCAD 2008. IEEE/ACM International Conference on, Nov 2008, pp. 722–727. 117 [...]... the die of existing FPGAs and use this information to control the regional power dissipation of the FPGA die However, the study of other effects on the reliability of FPGAs will be beyond the scope of this thesis 1.6 Key Contributions and Thesis Overview The main aim of this thesis is to study the problems associated with scalability in FPGAs and its resource management and propose practical and novel... scalability of FPGAs and also improve the productivity of designers by providing better tools and framework Following are some of the major contributions that have been achieved during the course of this research and have led to this thesis • A novel Scalable FPGA Architecture to significantly reduce the overheads associated with designing larger FPGAs This work was published in [32] • A design methodology... [34] The organization of thesis as follows Chapter 2 gives an overview of the existing FPGA architectures and explains the design flow associated with implementing hardware applications on FPGA Then it discusses runtime resource management for reconfigurable platforms and also present related works and its shortcoming In Chapter 3, we introduce our scalable FPGA architecture and its design methodology We... algorithms and design methodologies will be beyond the scope of this thesis Third, to improve the reliability of reconfigurable systems, we will study existing works and understand the difficulties in real time thermal monitoring of existing FPGAs Then, we will design and develop a low overhead temperature sensor using the resources within existing FPGAs This will enable us to monitor the temperature profile of. .. in scalable reconfigurable computing, and are addressed in this thesis • Scalability in FPGA Architecture: Design a FPGA architecture along with its interconnect to allow scalability without significant performance and area overheads (Scalability in FPGAs, Scalable Interconnect) • Efficiency in Resource Management: Design and develop a framework to allow reuse of implemented sub-components across designs... the merits and demerits of such an architecture by the use of a case study In Chapter 4, we present our framework for resource management of FPGA and demonstrate its effectiveness experimentally using a prototype Chapter 5 introduces our scalable off-chip interconnect and we evaluate its performance and overheads on a multi -FPGA system In Chapter 6, we present our LUT-based temperature sensor and compare... the performance and efficiency of FPGAs as compared to ASICs [3] However, scaling of FPGAs is not possible without increasing the routing resources of the FPGA The increase in switching requirement is asymptotically bounded below by Eq.1.1 [23] and is superlinear with number of logic resources Nsw (N) = BSW ·N = Ω N (p+0.5) , (1.1) where Nsw (N) is the number of switches, N is the number of logic resources,... Productivity) Fig 1.4: The Design Productivity Gap [2] and is shown in Fig.1.4 The productivity gap is a comparison between the manufacturing complexity (i.e number of transistors that we can manufacture) and the designer productivity (i.e number of transistors that we can design) Although this trend is for ASICs, we still find a similar trend in the design of circuits on FPGAs The complexity in designing digital... 2.1 Architectures of different programmable devices 16 2.2 Simplified architecture of a modern-day FPGA 18 2.3 Generic Design Flow 2.4 Configuration Architectures of FPGAs 22 3.1 Routing Methodology 3.2 Proposed sFPGA2 Implementation 32 3.3 sFPGA2 Architecture Block Diagram 33 3.4 IO Transceiver Design. .. 1.4.3 Scalable Interconnect Our scalable FPGA architecture presented earlier solves the issue of scalability for single -FPGA systems However, for multi -FPGA systems, the solution presented earlier is too expensive in terms of required off-chip infrastructure This is due to the fact that the hierarchical interconnect requires a large number of switching-units To solve this issue, we develop a scalable off-chip . DESIGN AND MANAGEMENT OF SCALABLE FPGA ARCHITECTURES RIZWAN SYED (M.Sc. Elect. & Comp. Engg, NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF ELECTRICAL AND. presents a scalable FPGA architecture to mitigate the issues with scalability of single FPGAs and multi -FPGA systems. Furthermore, it proposes an abstract architecture for resources management of scalable. scalability of FPGAs, we propose a novel scalable FPGA architecture and its design methodology. This architecture allows us to model both single FPGAs and multi -FPGA systems under a single architecture.

Định dạng
Số trang	135
Dung lượng	2,47 MB