CAREER: Parallel CAD Algorithms on Emerging Multi-Core Platforms



Peng Li


National Science Foundation (NSF)

Directorate for Computer & Information Science & Engineering (CSE)

Division of Computer and Communication Foundations (CCF)

NSF Award #:



August 1, 2008 – July 31, 2014

NSF Project Site:




Principle investigators:

Prof. Peng Li



Tong Xu, Ph. D. (to join Cadence Design Systems)

Jimmy (Yingyezhe) Jin, Ph. D.

Qian Wang, Ph. D.

Tony (Ya) Wang, Ph. D.

Honghuang Lin, Ph. D.



Botang Shao, M.S., Freescale Semiconductor, Austin, Texas

Yong Zhang, Ph. D., Cadence Design Systems, San Jose, California

Suming Lai, Ph. D., Maxim, San Diego, California

Wei Dong, Ph. D., joined with Texas Instruments, Dallas, Texas

Xiaoji Ye, Ph. D., joined Intel Corporation, Hillsboro, Oregon

Srinath Narasimhan, M.S., joined Intel Corporation, Hillsboro, Oregon

Chang Zhao, M. S., joined Synopsys, Hillsboro, Oregon

Zhiyu Zeng, Ph. D., joined Cadence Design Systems, Austin, Texas

Ruicheng Dai, M. S., joined Nvidia Corporation, Santa Clara, California

Mingchao Wang, M. S., joined Mentor Graphics, Wilsonville, Oregon



The recent industry’s shift to multi-core processor technology has literally made every modern-day desktop, sever, and laptop a parallel computing system. With multiple processing elements, or cores, integrated on a single chip, multi-core processors offer unprecedented, ubiquitously available parallel compute power. The paradigm change in computing has profound impacts on computer-aided design (CAD) of today’s ever complex nanoscale integrated circuits. This CAREER project intends to address the new computing challenges and opportunities as the current serial CAD practices transition to the future massively parallel paradigms, as dictated by the need to keep up with the increase in design complexity. The central approach of this work is to bring indispensable domain knowledge to parallel algorithm design so as to facilitate the most efficient use of emerging parallel hardware. A rich set of application-level coarse-grained parallelisms will be exploited under the single and multiple algorithm contexts to gain good parallel processing efficiency on multi-core and many-core platforms. A range of parallel CAD algorithms, encompassing key analysis, modeling and optimization steps in the design flow, will be developed. This work will lead to massively parallel CAD frameworks and tools that are critical to the design of a wide spectrum of nanometer digital, analog and mixed-signal integrated applications.

The computationally intensive nature of VLSI CAD makes it an important application domain of multi-core computing. The proposed work addresses the urgent need for developing parallel software applications and the resulting parallel computing paradigms are likely to be applicable to other science and engineering fields. As integral parts of this CAREER plan, education and research efforts will be integrated to provide research experience to undergraduate and graduate students. Research participation from minority and underrepresented student groups as well as K-12 education outreach will be actively promoted through existing research, outreach programs and inter-university collaborations. The PI will integrate the research outcomes of this project into undergraduate and graduate curriculum development and broadly disseminate them to academic, governmental and industrial sectors. Multidisciplinary, international and industrial collaborations will be formed to broaden the horizon of engineering students. Close industry ties will be leveraged to provide students with real-life experiences and facilitate immediate impacts of this work in industrial practice.

Related Publications 


Note: Supervised students are delineated with an asterisk (*).


[DAC'14] *Honghuang Lin and Peng Li, “Parallel hierarchical reachability analysis for analog verification,” in Proc. of IEEE/ACM Design Automation Conference, June 2014.


[TODAES’13] *Zhiyu Zeng, *Suming Lai and Peng Li, “IC power delivery: voltage regulation and conversion, system-level co-optimization and technology implications,” in ACM Trans. on Design Automation of Electronic Systems, March 2013. 


[ICCAD'12]  Peng Li, “Design analysis of IC power delivery,” IEEE/ACM Conf. on Computer-Aided Design (invited paper), pp. 664-666, November 2012.


[ICDCT’12] *Suming Lai, Peng Li and Zhiyu Zeng, “Design and analysis of IC power delivery with on-chip voltage regulation,” International Conference on IC Design and Technology, pp. 1-4, May 2012 (invited).


[ICCAD’11] *Zhiyu Zeng, *Tong Xu, Zhuo Feng and Peng Li, “Fast static analysis of power grids: algorithms and implementations,” in Proc. of IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 488-493, November 2011 (invited).


[TODAES’11] *Zhiyu Zeng, Zhuo Feng, Peng Li and Vivek Sarin, “Locality-driven parallel static analysis for power delivery networks,” in ACM Trans. on Design Automation of Electronic Systems, 16, 3, Article 28, 17 pages, June 2011.


[TCAD’11] *Xaioji Ye, *Wei Dong, Peng Li and Sani Nassif, “Hierarchical multialgorithm parallel circuit simulation,” in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, issue 1, pp. 45-58,  January 2011.


[ICCAD’10] *Xiaoji Ye and Peng Li, “On-the-fly runtime adaptation for efficient execution of parallel multi-algorithm circuit simulation,” in Proc. of IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 298-304, November 2010.


[DAC’10] *Xiaoji Ye and Peng Li, “Parallel program performance modeling for runtime optimization of multi-algorithm circuit simulation,” in Proc. of ACM/IEEE Design Automation Conf., pp. 561-566, June 2010.


[DAC10] *Xiaoji Ye and Peng Li, “Parallel program performance modeling for runtime optimization of multi-algorithm circuit simulation,” in Proc. of ACM/IEEE Design Automation Conf., June 2010 (acceptance rate 24.4%).


[TAU10] *Xiaoji Ye and Peng Li, “Performance modeling of a hierarchical multi-algorithm parallel circuit simulator,” in Proc. of ACM Int. Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, March 2010 (acceptance rate 50%).


[ICCAD09] *Wei Dong and Peng Li, “Final-value ODEs: stable numerical integration and its application to parallel circuit analysis,” in Proc. of IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 403-409, November 2009 (acceptance rate 26.3%).


[ICCAD09] *Xiaoji Ye, *Srinath S. Narasimhan and Peng Li, “Leveraging efficient parallel pattern search for clock mesh optimization,” in Proc. of IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 529-534, November 2009 (acceptance rate 26.3%).


[TCAD09] *Zhiyu Zeng and Peng Li, “Locality-driven parallel power grid optimization,” in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 8, pp. 1190-1200, August 2009.


[DAC09]*Wei Dong and Peng Li, “Parallelizable stable explicit numerical integration for efficient circuit simulation,” in Proc. of IEEE/ACM Design Automation Conf., pp. 382-385, July 2009 (acceptance rate 21.7%).


[TCAD09] *Wei Dong and Peng Li, “A parallel harmonic balance approach to steady-state and envelope-following simulation of driven and autonomous circuits,” in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 4, pp. 490-501, April 2009.


[ISQED09]*Zhiyu Zeng, Peng Li and *Zhuo Feng, “Parallel partitioning based on-chip power distribution network analysis using locality acceleration,” in Proc. of IEEE Int. Symp. on Quality Electronic Design, pp. 776-781, March 2009 (acceptance rate 29.0%).


[ICCAD08] *Xiaoji Ye, *Wei Dong, Peng Li and Sani R. Nassif, “MAPS: multi-algorithm parallel circuit simulation,” in Proc. of IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 73-78, November 2008 (acceptance rate 26.6%).


[DAC08] *Wei Dong, Peng Li and *Xiaoji Ye, “WavePipe: parallel transient simulation of analog and digital circuits on multi-core shared-memory machines,” in Proc. of IEEE/ACM Design Automation Conf., pp. 238-243, June 2008 (acceptance rate 23.0%), ( Best paper award, two out of 639 submissions, 0.3%).



Research Highlights 

It is envisioned that the number of processing cores in multi- or many-core processors will continue to scale in the future. This will enable powerful yet affordable parallel computing systems for a range of applications. While it is relatively straightforward to employ processors with a large number of cores to process multiple independent tasks concurrently (e.g. multiple query requests to data centers), a much greater challenge exists when a single computationally intensive task is parallelized.

Multi-Algorithm Parallel circuit Simulation (MAPS)

Under the context of transistor-level circuit simulation, a key design and verification step in the modern VLSI design flow, novel parallel algorithms are being explored. To fully utilize the parallel computing power of present-day and future multi-core processors, it is essential to create and explore a rich set of independent parallelisms. Our recent work has demonstrated an unconventional parallel simulation framework: MAPS (Multi-Algorithm Parallel circuit Simulation, ICCAD’08), as illustrated in the following figure. Unlike conventional approaches where a single (parallel) algorithm is applied to solve a task, in MAPS, multiple algorithms with varying characteristics are launched to process the same simulation task. Parallel speedups are obtained by having these algorithms interact with each other in a cooperative manner on the fly. Our current implementation creates a set of algorithm candidates by paring various numerical integration methods with nonlinear iterative methods, creating a spectrum of algorithms consisting of slow-and-robust vs. fast-but-non-robust methods. A standard simulation algorithm, which is robust but potentially slow, is always chosen in the algorithm pool to guarantee the success of the simulation.

 MAPS has several novel characteristics. It embodies the notion of Inter-Algorithm Parallelism (v.s. more conventional intra-algorithm parallelism) in the sense that multiple algorithms are concurrently executed for a single task. The synchronization between the set of simultaneously running algorithms tends to very coarse grained, leading to ease in parallel implementation, debug, and (serial) code reuse. Inter-algorithm parallelisms are orthogonal to intra-algorithm parallelisms, opening up new avenues to exploiting massively parallel computing platforms. The multi-algorithm framework of MAPS allows the use of opportunistically fast but non-robust simulation methods, which cannot be employed in a single-algorithm simulation context. An algorithm doesn’t have to converge during the entire course of simulation – if it runs into trouble, its friends will help. On the other hand,   if the algorithm runs really fast for some parts of simulation – it will speed up the entire thing quite a bit! As a result, favorable, sometimes, even “superlinear” speedups are observed.  


Locality-Driven Parallel Power Grid Optimization

Large very large-scale-integration power/ground distribution networks are challenging to analyze and design due to the sheer network complexity. In this paper, a parallel sizing optimization approach is presented to minimize the wiring area of a power grid while meeting IR drop and electromigration constraints. Motivated by a proposed two-level hierarchical optimization, we develop a novel locality-driven partitioning scheme to allow for divide-and-conquer-based scalable optimization of large power grids, which is infeasible via flat optimization. Unlike existing partitioning-based strategies, the proposed method is very flexible in terms of choice of partitioning boundaries and sizes. Equally importantly, it allows for simultaneous sizing of multiple partitions, leading itself naturally to parallelization [Zeng and Li, TCAD’09]. 

We focus on C4 flip-chip type on-chips. We first reformulate the original (flat) constrained power grid optimization problem into a two-level optimization problem. This new two-level hierarchical formulation builds upon the essential idea of partitioning based optimization and has an appealing form that seemingly enables divide-and-conquer-based scalable optimization. Motivated by this hierarchical view, we develop practical solutions to address the fundamental limitations of the hierarchical approach, in terms of convergence and efficiency, leading to a locality-driven two-step optimization. One key feature of the proposed approach is that it is fully parallelizable since our algorithm construction permits simultaneous sizing of an arbitrary selection of partitions, including those that are adjacent to each other. This offers the important ability of utilizing increasing parallel computing hardware to address the challenges in power grid design. The performance of the proposed approach is largely independent of the choice of partitioning boundaries, hence not constrained by design hierarchy. As a result, the partitioning boundaries, size, and number of partitions can be flexibly chosen to tradeoff between runtime and memory requirement as well as to facilitate load balancing in parallelization.  The overall optimization flow is illustrated in the following figure, where the first level optimization problem is formulated in terms of unknown voltage and currents along the partitioning boundaries and the second level optimization consists of a set of individual sizing problems for all partitions for given boundary voltages and currents.



Parallel Pattern Search based Sizing Optimization for Large Clock Meshes

Mesh-based clock distribution network has been employed in many high-performance microprocessor designs due to its favorable properties such as low clock skew and robustness. Such clock distributions are usually highly complex. While the simulation of clock meshes is already time consuming, tuning such networks under tight performance constraints is a more daunting task. We address the challenging task of driver sizing optimization with a goal of skew minimization. The expensive objective function evaluations and difficulty in getting explicit sensitivity information make this problem intractable to standard optimization methods. We explore the recently developed asynchronous parallel pattern search (APPS) method for efficient driver size tuning. While being a search-based method, APPS not only provides the desirable derivative-free optimization capability, but also is amenable to parallelization and possess appealing theoretically rigorous convergence properties. We employ such method can lead to powerful parallel sizing optimization of large clock meshes with significant runtime and quality advantages over the traditional sequential quadratic programming (SQP) method. We also develop design-specific properties and speeding-up techniques can be exploited to make the optimization even more efficient while maintaining the convergence of APPS in a practical sense [Ye, Narasimhan and Li, ICCAD’09].

Acknowledgement and Disclaimer 


This material is based upon work supported by the National Science Foundation under Grant No. 0747423.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Copyright by Peng Li.