Algorithm Acceleration using GPUs, FPGAs and Custom ICs
Over the last few years, Graphics Processing Units (GPUs) have become
increasingly flexible and powerful. GPUs are used to drive the display of
desktop and laptop computers. Recent GPUs consist of a multitude (up to
240) of simple processors, which operate in lock-step. My group was the
first to publish results on the use of GPUs to accelerate VLSI design
algorithms. One of the
first algorithms we accelerated on the GPU was circuit simulation. A
circuit simulator (such as SPICE) is a software tool that is used to
simulate the electrical behavior of a VLSI circuit before it is
manufactured, in order to ensure that it meets speed and power
budgets. Part of our work in this area was done with funding from
Nascentric Inc., an Austin, Texas startup. Nascentric's fast
circuit simulator was sped up by a further 2.5X as a result of our
research. Our software was integrated into Nascentric's product
line. Nascentric
demonstrated this product at the Design Automation Conference in 2008.
In addition, we have accelerated fault simulation, Boolean Satisfiability, fault table generation
as well as statistical static timing analysis (SSTA) on the GPU. We are
currently working on an accelerated SAT solution approach on the GPU, as
well as a automatic engine to generate GPU code from in-line C code.
Other efforts include MIMO decoding on the GPU and radar signal processing on
the GPU.
We have worked extensively on FPGA and custom-IC based algorithm
accelerators as well. The algorithms we have targeted are SAT, sorting,
comparison, signal processing for weather radar, WiMAX decoding, LDPC
decoding, sphere decoding for MIMO, logarithm/antilogarithm computation and
cryptokey generation.
Publications, patents and artefacts:
- "Towards Acceleration of Fault Simulation using Graphics Processing Units",
Gulati, Khatri. ACM/ EDAC/IEEE Design Automation Conference (DAC) 2008,
June 8-13 2008, Anaheim, CA, pp. 822-827. We demonstrated a 35X speedup
(estimated 240X speedup with a 8-GPU server) for fault simulation, over a
leading commercial tool. This was the first published paper on the use of
GPUs for CAD algorithm acceleration. Presentation slides.
- "Fast Circuit Simulation on Graphics Processing
Units",
Gulati, Croix,
Khatri, Shastry. IEEE/ACM Asia and South Pacific Design Automation
Conference (ASP-DAC) 2009, Yokohama, Japan, Jan 19-22 2009, pp. 403-408.
Using the code base of a leading fast SPICE startup, we demonstrated a 40X
speedup for BSIM3 model card evaluation, compared to a CPU based
implementation. The fast SPICE tool was sped up by a factor of 2.4X. Our
work was integrated into this tool, and offered commercially. Presentation slides.
- "Accelerating Statistical Static Timing Analysis Using Graphics Processing
Units", Gulati, Khatri. IEEE/ACM Asia and South Pacific Design Automation
Conference (ASP-DAC) 2009, Yokohama, Japan, Jan 19-22 2009,
pp. 260-265. This paper reports our work on speeding up SSTA on the
GPU. Our results show a 260X speedup (estimated 785X speedup with a Quad
GPU server) over the CPU based implementation. Presentation slides.
- "Fault Table Computation on GPUs", Gulati, Khatri. Journal of Electronic Testing: Theory and Applications (JETTA). Vol 26, number 2, April
2010. pp 195-209. In this paper, we utilize
the GPU to perform fault table generation, with a speedup of 20X (projected
80X speedup with a 8-GPU server) compared to a CPU based implementation.
This paper is also listed in the "GPUs for VLSI CAD" section. This paper
also appeared in the International Test Synthesis Workshop (ITSW) 2009,
where it received the Best student paper award.
- "Boolean Satisfiability on a Graphics Processor",
Gulati, Khatri. Great
Lakes Symposium on VLSI (GLS-VLSI) 2010. Providence, RI. May 16-18, 2010.
In this paper, we implement Boolean Satisfiability on
a GPU, using a technique which achieves a significant speedup over MiniSAT,
a leading CPU based SAT solver while retaining the completeness of the
search.
- "An Automated Approach for SIMD Kernel Generation for GPU based Software
Acceleration", Gulati,
Khatri. Symposium on Application Accelerators in
High Performance Computing (SAAHPC) 2009, Urbana, IL. July 28-30, 2009.
In this paper, we present an approach to
automatically generate GPU code from in-line CPU code, for routines that
are run repeatedly on independent data. Our GPU kernel generation engine
quickly presents the user with a kernel configuration which typically is
among the fastest realizations of the subroutine on the GPU.
- "FPGA-Based Hardware Acceleration for Boolean Satisfiability", Gulati,
Paul, Khatri, Patil, Jas. ACM Transactions on Design Automation of
Electronic Systems (TODAES). Vol 14, number 2, Mar 2009. Among the top 10
downloaded papers for the journal in 2010. Nominated for best paper for the
journal (2010). This paper presents an FPGA based scalable SAT solver.
The SAT instance is partitioned up-front, and the FPGA solves one partition
at a time, with the ability to do non-chronological backtrack within as
well as across partitions, and operates about 17X faster than MiniSAT, a
very powerful software based solver.
- "An Efficient, Scalable Hardware Engine for Boolean Satisfiability and
Unsatisfiable Core Extraction", Gulati, Waghmode, Khatri, Shi. IET
Computers and Digital Techniques, vol. 2, number 3, May 2008, pp. 214-229.
A shorter version of this paper appeared in the IEEE International Conference on Computer
Design (ICCD) 2006 as well. This paper is a custom-IC based design of a SAT
solver. The solution is designed to scale elegantly, and provides over
three orders of magnitude speedup over software based approaches.
- "Sorting Binary Numbers in Hardware - a Novel Algorithm and its
Implementation", Alaparthi, Gulati, Khatri. International Symposium on
Circuits and Systems (ISCAS) 2009, Taipei, Taiwan. May 24-27, 2009. In this paper, we present a fast special
function unit for sorting, which is based on a column scan, and is significantly faster than the best
known existing approach, with lower area (for larger numbers).
- "CMOS Comparators for High-Speed and
Low-Power Applications", Menendez, Maduike, Garg, Khatri. IEEE
International Conference on Computer Design (ICCD), Oct 1-4, 2006, San
Jose, CA, pp. 76-81. We present two novel ways to design hardware
comparators, yielding about 37% improvement over competing approaches.
- "A Generic Radar Processor Design Using Software Defined Radio", Brimeyer,
Martin, Loew, Farquharson, Khatri, Paul. 33rd American Meteorology Society
(AMS) Conference on Radar Meteorology, Aug 6-10, 2007, Cairns,
Australia. In this paper, we present a highly configurable,
power-efficient radar signal processor for weather applications.
- "FPGA Based Signal Processing Platform For Weather Radar", Paul, Khatri,
Martin, Brimeyer, Loew, Vivekanandan. International Geoscience and Remote
Sensing Symposium (IGARSS) 2007, July 23-27, Barcelona, Spain. This paper
reports the results from an FPGA based implementation of a weather radar
processor that we conducted. This processor was driven by a PCI
interface, and was verified to operate correctly.
- Invited Paper "Highly Parallel Decoding of Space-Time Codes on Graphics
Processing Units", Bollapalli, Wu, Gulati, Khatri, Calderbank. Annual
Allerton Conference on Communication, Control and Computing, 2009, Urbana,
IL. Sept 30 – Oct 2, 2009. In this paper, we implement the decoding tasks of a
WiMAX base station on a GPU. We show that the GPU implementation is about
700X faster than a serial implementation. The results are significant,
since they could dramatically reduce the cost of electronics in the base
station.
- "High-throughput VLSI Implementations of Iterative Decoders and Related
Code Construction Problems". Nagarajan, Laendner, Jayakumar, Milenkovic,
Khatri. Springer Journal of VLSI Signal Processing, vol 49, number 1, Oct
2007, pp 185-206. A shorter version of this
paper appeared,
in GlobeComm 2004, pp. 361-365. This effort reports the results of a
custom-IC implementation of an LDPC decoder. The approach reduces routing
congestion by careful code construction, and improves over standard cell
based implementations in terms of power, delay as well as area.
- "VLSI Implementation of a Staggered Sphere Decoder Design for MIMO
Detection", Bhagawat, Ekambavanan, Das, Choi, Khatri. 45th Annual Allerton
Conference on Communication, Control and Computing, Sept 26-28, 2007,
Urbana, IL, pp. 228-235. A custom IC implementation of a sphere decoder is
undertaken in this paper. Compared to existing approaches, our approach
achieves significantly better throughput per unit power or throughput per
unit area.
- "A Novel Cryptographic Key Exchange Scheme
using Resistors", Lin, Ivanov, Johnson, Khatri. IEEE International
Conference on Computer Design (ICCD) 2011, Amherst, MA, Oct 2011. pp
451-452. In this paper, we report a practical FPGA based implementation of the Kish
cipher, intended to use over the internet. Given a single shared secret
between Alice and Bob, they are both able to generate a new shared secret
(cryptographic key).
- "VLSI Implementation of a Non-Linear
Feedback Shift Register for High-Speed Cryptography Applications",
Lin, Khatri. Great Lakes Symposium on VLSI (GLS-VLSI) 2010. Providence,
RI May 16-18, 2010. This paper presents an extremely fast LFSR based
cryptographic key generator, which can operate at rates which match
OC-768 optical fiber communication rates.
- "A Fast Hardware Approach for Approximate,
Efficient Logarithm and Antilogarithm Computations", Paul,
Jayakumar, Khatri. IEEE Transactions on Very Large Scale Integration
Systems, vol. 17, number 2, Feb 2009, pp. 269-277. This paper reports an FPGA based log/antilog
engine, which is based on table lookup followed by interpolation
(without needing to perform multiplication or division).
- "Manycore, Heterogenous, or Neither: Which One is the Way to Go for
EDA?". Invited to serve as a panelist to discuss this topic, along with
other leading researchers in the field of GPU based implementation of EDA
algorithms. The panel was part of the IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), San Jose, CA, Nov 7-10, 2011.
- "Introducton to GPU Programming for EDA", Croix, Khatri. This tutorial presents an overview of the GPU architecture and the constraints it poses on accelerating software. Acceleration guidelines are provided, with the help of several case studies. This tutorial was presented
at the International Conference on Computer-Aided Design (ICCAD), 2009.
Slides
- "EDA Algorithm Acceleration with FPGAs, GPUs, and Custom ICs", Gulati,
Khatri. Monograph published by Springer Publishers. 1st edition,
2010. 194p. ISBN 978-1-4419-0943-5. This monograph provides a comprehensive
coverage of our work in accelerating CAD algorithms on the GPU.