<u>Projects in Algorithm Acceleration using GPUs, FPGAs and Custom ICs </u>

Algorithm Acceleration using GPUs, FPGAs and Custom ICs

Over the last few years, Graphics Processing Units (GPUs) have become increasingly flexible and powerful. GPUs are used to drive the display of desktop and laptop computers. Recent GPUs consist of a multitude (up to 240) of simple processors, which operate in lock-step. My group was the first to publish results on the use of GPUs to accelerate VLSI design algorithms. One of the first algorithms we accelerated on the GPU was circuit simulation. A circuit simulator (such as SPICE) is a software tool that is used to simulate the electrical behavior of a VLSI circuit before it is manufactured, in order to ensure that it meets speed and power budgets. Part of our work in this area was done with funding from Nascentric Inc., an Austin, Texas startup. Nascentric's fast circuit simulator was sped up by a further 2.5X as a result of our research. Our software was integrated into Nascentric's product line. Nascentric demonstrated this product at the Design Automation Conference in 2008.

In addition, we have accelerated fault simulation, Boolean Satisfiability, fault table generation as well as statistical static timing analysis (SSTA) on the GPU. We are currently working on an accelerated SAT solution approach on the GPU, as well as a automatic engine to generate GPU code from in-line C code. Other efforts include MIMO decoding on the GPU and radar signal processing on the GPU.

We have worked extensively on FPGA and custom-IC based algorithm accelerators as well. The algorithms we have targeted are SAT, sorting, comparison, signal processing for weather radar, WiMAX decoding, LDPC decoding, sphere decoding for MIMO, logarithm/antilogarithm computation and cryptokey generation.

Publications, patents and artefacts:

"Towards Acceleration of Fault Simulation using Graphics Processing Units", Gulati, Khatri. ACM/ EDAC/IEEE Design Automation Conference (DAC) 2008, June 8-13 2008, Anaheim, CA, pp. 822-827. We demonstrated a 35X speedup (estimated 240X speedup with a 8-GPU server) for fault simulation, over a leading commercial tool. This was the first published paper on the use of GPUs for CAD algorithm acceleration. Presentation slides.
"Fast Circuit Simulation on Graphics Processing Units", Gulati, Croix, Khatri, Shastry. IEEE/ACM Asia and South Pacific Design Automation Conference (ASP-DAC) 2009, Yokohama, Japan, Jan 19-22 2009, pp. 403-408. Using the code base of a leading fast SPICE startup, we demonstrated a 40X speedup for BSIM3 model card evaluation, compared to a CPU based implementation. The fast SPICE tool was sped up by a factor of 2.4X. Our work was integrated into this tool, and offered commercially. Presentation slides.
"Accelerating Statistical Static Timing Analysis Using Graphics Processing Units", Gulati, Khatri. IEEE/ACM Asia and South Pacific Design Automation Conference (ASP-DAC) 2009, Yokohama, Japan, Jan 19-22 2009, pp. 260-265. This paper reports our work on speeding up SSTA on the GPU. Our results show a 260X speedup (estimated 785X speedup with a Quad GPU server) over the CPU based implementation. Presentation slides.
"Fault Table Computation on GPUs", Gulati, Khatri. Journal of Electronic Testing: Theory and Applications (JETTA). Vol 26, number 2, April 2010. pp 195-209. In this paper, we utilize the GPU to perform fault table generation, with a speedup of 20X (projected 80X speedup with a 8-GPU server) compared to a CPU based implementation. This paper is also listed in the "GPUs for VLSI CAD" section. This paper also appeared in the International Test Synthesis Workshop (ITSW) 2009, where it received the Best student paper award.
"Boolean Satisfiability on a Graphics Processor", Gulati, Khatri. Great Lakes Symposium on VLSI (GLS-VLSI) 2010. Providence, RI. May 16-18, 2010. In this paper, we implement Boolean Satisfiability on a GPU, using a technique which achieves a significant speedup over MiniSAT, a leading CPU based SAT solver while retaining the completeness of the search.
"An Automated Approach for SIMD Kernel Generation for GPU based Software Acceleration", Gulati, Khatri. Symposium on Application Accelerators in High Performance Computing (SAAHPC) 2009, Urbana, IL. July 28-30, 2009. In this paper, we present an approach to automatically generate GPU code from in-line CPU code, for routines that are run repeatedly on independent data. Our GPU kernel generation engine quickly presents the user with a kernel configuration which typically is among the fastest realizations of the subroutine on the GPU.
"FPGA-Based Hardware Acceleration for Boolean Satisfiability", Gulati, Paul, Khatri, Patil, Jas. ACM Transactions on Design Automation of Electronic Systems (TODAES). Vol 14, number 2, Mar 2009. Among the top 10 downloaded papers for the journal in 2010. Nominated for best paper for the journal (2010). This paper presents an FPGA based scalable SAT solver. The SAT instance is partitioned up-front, and the FPGA solves one partition at a time, with the ability to do non-chronological backtrack within as well as across partitions, and operates about 17X faster than MiniSAT, a very powerful software based solver.
"An Efficient, Scalable Hardware Engine for Boolean Satisfiability and Unsatisfiable Core Extraction", Gulati, Waghmode, Khatri, Shi. IET Computers and Digital Techniques, vol. 2, number 3, May 2008, pp. 214-229. A shorter version of this paper appeared in the IEEE International Conference on Computer Design (ICCD) 2006 as well. This paper is a custom-IC based design of a SAT solver. The solution is designed to scale elegantly, and provides over three orders of magnitude speedup over software based approaches.
"Sorting Binary Numbers in Hardware - a Novel Algorithm and its Implementation", Alaparthi, Gulati, Khatri. International Symposium on Circuits and Systems (ISCAS) 2009, Taipei, Taiwan. May 24-27, 2009. In this paper, we present a fast special function unit for sorting, which is based on a column scan, and is significantly faster than the best known existing approach, with lower area (for larger numbers).
"CMOS Comparators for High-Speed and Low-Power Applications", Menendez, Maduike, Garg, Khatri. IEEE International Conference on Computer Design (ICCD), Oct 1-4, 2006, San Jose, CA, pp. 76-81. We present two novel ways to design hardware comparators, yielding about 37% improvement over competing approaches.
"A Generic Radar Processor Design Using Software Defined Radio", Brimeyer, Martin, Loew, Farquharson, Khatri, Paul. 33rd American Meteorology Society (AMS) Conference on Radar Meteorology, Aug 6-10, 2007, Cairns, Australia. In this paper, we present a highly configurable, power-efficient radar signal processor for weather applications.
"FPGA Based Signal Processing Platform For Weather Radar", Paul, Khatri, Martin, Brimeyer, Loew, Vivekanandan. International Geoscience and Remote Sensing Symposium (IGARSS) 2007, July 23-27, Barcelona, Spain. This paper reports the results from an FPGA based implementation of a weather radar processor that we conducted. This processor was driven by a PCI interface, and was verified to operate correctly.
Invited Paper "Highly Parallel Decoding of Space-Time Codes on Graphics Processing Units", Bollapalli, Wu, Gulati, Khatri, Calderbank. Annual Allerton Conference on Communication, Control and Computing, 2009, Urbana, IL. Sept 30 – Oct 2, 2009. In this paper, we implement the decoding tasks of a WiMAX base station on a GPU. We show that the GPU implementation is about 700X faster than a serial implementation. The results are significant, since they could dramatically reduce the cost of electronics in the base station.
"High-throughput VLSI Implementations of Iterative Decoders and Related Code Construction Problems". Nagarajan, Laendner, Jayakumar, Milenkovic, Khatri. Springer Journal of VLSI Signal Processing, vol 49, number 1, Oct 2007, pp 185-206. A shorter version of this paper appeared, in GlobeComm 2004, pp. 361-365. This effort reports the results of a custom-IC implementation of an LDPC decoder. The approach reduces routing congestion by careful code construction, and improves over standard cell based implementations in terms of power, delay as well as area.
"VLSI Implementation of a Staggered Sphere Decoder Design for MIMO Detection", Bhagawat, Ekambavanan, Das, Choi, Khatri. 45th Annual Allerton Conference on Communication, Control and Computing, Sept 26-28, 2007, Urbana, IL, pp. 228-235. A custom IC implementation of a sphere decoder is undertaken in this paper. Compared to existing approaches, our approach achieves significantly better throughput per unit power or throughput per unit area.
"A Novel Cryptographic Key Exchange Scheme using Resistors", Lin, Ivanov, Johnson, Khatri. IEEE International Conference on Computer Design (ICCD) 2011, Amherst, MA, Oct 2011. pp 451-452. In this paper, we report a practical FPGA based implementation of the Kish cipher, intended to use over the internet. Given a single shared secret between Alice and Bob, they are both able to generate a new shared secret (cryptographic key).
"VLSI Implementation of a Non-Linear Feedback Shift Register for High-Speed Cryptography Applications", Lin, Khatri. Great Lakes Symposium on VLSI (GLS-VLSI) 2010. Providence, RI May 16-18, 2010. This paper presents an extremely fast LFSR based cryptographic key generator, which can operate at rates which match OC-768 optical fiber communication rates.
"A Fast Hardware Approach for Approximate, Efficient Logarithm and Antilogarithm Computations", Paul, Jayakumar, Khatri. IEEE Transactions on Very Large Scale Integration Systems, vol. 17, number 2, Feb 2009, pp. 269-277. This paper reports an FPGA based log/antilog engine, which is based on table lookup followed by interpolation (without needing to perform multiplication or division).
"Manycore, Heterogenous, or Neither: Which One is the Way to Go for EDA?". Invited to serve as a panelist to discuss this topic, along with other leading researchers in the field of GPU based implementation of EDA algorithms. The panel was part of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, Nov 7-10, 2011.
"Introducton to GPU Programming for EDA", Croix, Khatri. This tutorial presents an overview of the GPU architecture and the constraints it poses on accelerating software. Acceleration guidelines are provided, with the help of several case studies. This tutorial was presented at the International Conference on Computer-Aided Design (ICCAD), 2009. Slides
"EDA Algorithm Acceleration with FPGAs, GPUs, and Custom ICs", Gulati, Khatri. Monograph published by Springer Publishers. 1st edition, 2010. 194p. ISBN 978-1-4419-0943-5. This monograph provides a comprehensive coverage of our work in accelerating CAD algorithms on the GPU.