The project Meeting the Design Challenges of nano-CMOS Electronics (http://www.nanocmos.ac.uk) was funded by the Engineering and Physical Sciences Research Council to tackle the challenges facing the electronics industry caused by the decreasing scale of transistor devices, and the inherent variability that this exposes in devices and in the circuits and systems in which they are used. The project has developed a grid-based solution that supports the electronics design process, incorporating usage of large-scale high-performance computing (HPC) resources, data and metadata management and support for fine-grained security to protect commercially sensitive datasets. In this paper, we illustrate how the nano-CMOS (complementary metal oxide semiconductor) grid has been applied to optimize transistor dimensions within a standard cell library. The goal is to extract high-speed and low-power circuits which are more tolerant of the random fluctuations that will be prevalent in future technology nodes. Using statistically enhanced circuit simulation models based on three-dimensional atomistic device simulations, a genetic algorithm is presented that optimizes the device widths within a circuit using a multi-objective fitness function exploiting the nano-CMOS grid. The results show that the impact of threshold voltage variation can be reduced by optimizing transistor widths, and indicate that a similar method could be extended to the optimization of larger circuits.
Fundamental to the continued growth of the semiconductor industry is Moore’s Law, which states that the number of transistors integrated on a chip will double every 2 years, owing to the shrinking of devices through advances in technology. Recently, the scale of devices has approached the level where the precise placement of individual dopant atoms will affect the output characteristics of a device. As these intrinsic variations become more abundant, higher failure rates and lower yields will be observed from conventional designs. Coping with intrinsic variability has been recognized as one of the major unsolved challenges faced by the semiconductor industry (Asenov 1999; Bernstein, K. et al. 2006).
One approach to reduce the effect of variability is to change the materials used in the construction, as implemented by Intel with the use of a metal gate with a high-κ gate oxide in their 45 nm complementary metal oxide semiconductor (CMOS) fabrication. An alternative approach is to use circuit topologies optimized to be variability tolerant within standard cell libraries (SCLs). In the past, evolutionary algorithms (EAs) have been used to optimize existing CMOS circuit designs for a number of criteria, such as delay (Salomon & Sill 2007), area (Noren & Ross 2001), power and yield (Takahashi et al. 2005). EAs have also been used to produce new unconventional designs at both the gate level (Miller et al. 2000) and device level (Streeter et al. 2003), as well as fault-tolerant designs (Djupdal & Haddow 2007). IBM have also been investigating the use of genetic algorithms (GAs) to optimize their cell-library designs (Bernstein, K. et al. 2006).
The UK e-Science pilot project Meeting the Design Challenges of nano-CMOS Electronics (nano-CMOS) has been funded by the Engineering and Physical Sciences Research Council to address the challenges facing the semiconductor industry caused by the decreasing atomistic dimensions of CMOS transistors. Addressing these challenges requires the integration of transistor variability, modelled by computationally expensive atomic-scale transistor simulations of ensembles of microscopically different devices, with families of circuit and system-level simulations. The management and exploitation, in a secure manner, of the large volume of heterogeneous data, and the associated metadata, created by the design process is of vital importance. This problem has been addressed by creation of the nano-CMOS grid.
In this paper, we present the nano-CMOS grid, and demonstrate how it supports a grid-based methodology for optimizing logic cells from an SCL for high-speed, low-power and variability-tolerant designs, using a multi-objective GA exploiting a range of distributed HPC resources to perform statistical circuit simulations.
The structure of this paper is as follows: §2 discusses the causes and impact of device variability, and outlines the methods used to extract accurate data models which incorporate random variations. Section 3 describes the nano-CMOS grid infrastructure. Section 4 describes the proposed approach for optimizing device sizes within SCL logic topologies. Section 5 provides details of the experimental results. The conclusions and proposals for future work are summarized in §6.
2. CMOS variability
CMOS devices form the backbone of almost all modern digital circuits. Integrated circuits are assembled from complementary pairs of p-type MOSFET (PMOS) and n-type MOSFET (NMOS) transistors optimized for high-speed and low-power consumption. For many years, the cyclical process of reducing transistor channel length has resulted in devices both faster and lower in power consumption than the previous generation, with modern microprocessors boasting in excess of 1 billion transistors and gate lengths of under 50 nm (Streetman & Banerjee 2000). The International Technology Semiconductor Road-map (ITRS) published by the Semiconductor Industry Association projects an annual reduction of 11 per cent in gate length, resulting in reduced operating voltages and a decrease in the gate delay of 10 per cent per year (Wyon 2002). This projected improvement is under threat from the problem of decreased yield caused by heightened variability as devices shrink.
(a) Causes of device variability
The precision of individual device and interconnect parameters has traditionally been dependent on constraints within the manufacturing process, and has been considered deterministic in nature. As channel lengths shrink below 50 nm, unavoidable stochastic variability owing to the actual location of individual dopant atoms within the device channel is becoming increasingly significant. This is illustrated to scale in figure 1a,b, which show that, as devices get smaller (22–4.2 nm), the ratio of device size to constituent-atom size becomes less favourable; therefore, the variable constitution at the atomic scale has an increased effect on device behaviour. Many advances have been made to reduce the loss of precision caused by the manufacturing process; however, the fundamental quantum-mechanical limitations cannot be overcome, and their impact will increase as the technology shrinks further (Asenov 1999).
Device variability occurs in both the spatial and temporal domains, and each includes both deterministic and stochastic fluctuations. Spatial variability occurs when the produced device shape differs from the intended design, including uneven doping profiles, non-uniformity in layer thickness and poly-crystalline surfaces. This variability is found at all levels: over the lifetime of a fabrication system, across a wafer of chips, between cells within a very large scale integration (VLSI) chip and between individual devices within that cell. Temporal variability includes the effects of electromigration, gate-oxide breakdown and the distribution of negative-bias temperature instability. Such temporal variability has been estimated, and can be combined to give an expected lifetime calculation for an individual device, or simulated to determine the compound effect across a whole chip (Rubio et al. 2004; Bernstein, J. B. et al. 2006). While deterministic variability can be accurately estimated using specific design techniques, intrinsic parameter fluctuations can only be modelled statistically and cannot be reduced with improvements in the manufacturing process (Asenov 2007; Mizuno & De 2007).
(b) Intrinsic parameter fluctuations
Intrinsic variability is caused by the atomic-level differences in devices that could be considered identical in layout, construction and environment. Summarized below are the principal sources of intrinsic variability, as illustrated in figure 1c.
— Random dopant fluctuations are unavoidable variations caused by the precise number and position of dopant atoms within the silicon lattice, which exist even with a tightly controlled implant and annealing process. This uncertainty results in substantial variability in the device threshold voltage, subthreshold slope and drive current, with the most significant variations caused by atoms near the surface and channel of the device (Asenov 1999).
— Line edge roughness (LER) is the deviation in the horizontal plane of a fabricated feature boundary from its ideal form. LER has both a deterministic nature, caused by imperfections in the mask-manufacturing, photo-resist and etching processes, and also a stochastic nature owing to the discrete nature of molecules used within the photo-resist layer, resulting in a random roughness on the edges of blocks etched onto the wafer (Asenov 2007).
— Surface roughness is the vertical deviation of the actual surface compared with the ideal form. The shrinking of surface layers, in particular the oxide layer, results in variations in the parasitic capacitances between terminals which can add to VT variations (Moroz 2007).
— Poly-silicon grain boundary variability is the variation due to the random arrangement of grains within the gate material owing to their polycrystalline structure. Implanted ions can penetrate through the poly-silicon and insulator into the device channel, resulting in localized stochastic variations (Eccleston 1999).
(c) Modelling intrinsic variability
To accurately model of the effects of intrinsic parameter fluctuations it is necessary to use statistical three-dimensional simulation methods with a fine-grained discretization. The device modelling group (DMG) within the University of Glasgow (Asenov 1999, 2007) has become one of the leading research centres for three-dimensional device modelling using their atomistic simulator, which adapts conventional three-dimensional device modelling tools to incorporate the intrinsic effects described above. To categorize a particular transistor, a large number of current–voltage (I–V) curves are extracted and then used to calibrate a subset of parameters to create a model library representing the device.
For the experiments described in this paper, a library of 200 different NMOS and PMOS models, based on a 35×35 nm Toshiba device, has been used. To use these models within an open source implementation of the Berkeley SPICE (Simulation Program with Integrated Circuit Emphasis; see http://bwrc.eecs.berkeley.edu/Classes/icbook/SPICE/) circuit simulator, known as NGSPICE (http://ngspice.sourceforge.net/), the DMG has developed a tool, RandomSpice, which replaces the transistors within a template netlist with models selected randomly from the library. To allow transistors with different widths to be simulated, subcircuits of random transistors connected in parallel are assembled. To estimate the impact of variability, RandomSpice creates a set of output netlists which are then processed by NGSPICE. RandomSpice can also create a single netlist in which only uniform 35 nm transistor models are used, without the parameter fluctuations, allowing the variable output to be compared with a uniform ideal output.
3. The nano-CMOS grid
The infrastructure developed to support job creation and submission within the nano-CMOS project comprises a number of Web services, each of which provides a particular category of functionality, e.g. creation of a particular type of job or submission to a particular type of resource. These take the form of Apache Axis2 Web services (http://ws.apache.org/axis2), with Apache Rampart (http://ws.apache.org/rampart) providing message-level security—encryption, message signing and time-stamping—on all client–service communication. Various command-line client applications, distributed in one self-contained bundle, have been written to provide an easy way to interact with these services.
Running a job is a two-step process: an application service is invoked which has responsibility for creating the job; following this, the job is submitted to a particular HPC resource, either by way of a submission service or by direct Globus submission. This process is closely linked to the data management system, which itself exploits the Andrew File System (AFS) (Howard 1988) as described below.
The infrastructure currently includes three application services, supporting the atomistic device simulator and RandomSpice, as well as arbitrary simulations defined by the user. Although there are inevitable internal differences owing to the fact they support different applications, the external interface is common to all application services, and so the same sequence of commands can be used to create and run jobs regardless of type.
In order to create a job, a user passes an input file to the application service; this file is a modification of the input file passed to the underlying application, and contains additional information required to prepare the job for submission to an HPC resource, such as the desired number of sub-jobs (i.e. the number of constituent units of work, each of which could, depending on the job manager on the target resource, be executed in parallel), the number of simulations per sub-job, and the version of the application to run. Additionally, this file contains a reference to a template directory on AFS, which is created by the user, and contains any files that will be required by the simulation.
On receipt of a request to create a job, the application service duplicates the template directory once per sub-job, writing a customized input file to each; this input file is based on that provided by the user, but modified in order that each sub-job will perform its designated portion of the complete simulation. Finally, a new record is created in the data service, containing a description of the job written in the job submission description language (JSDL; Anjomshoaa et al. 2005), with information about sub-jobs encoded using the JSDL parameter sweep extension (Drescher et al. 2008). The uniform resource identifier (URI) of this record is returned to the user, and this becomes the identifier by which the job will forevermore be referenced.
Following the successful creation of a job, a user can invoke the submission client application, passing it the job’s URI. The client application presents a list of supported execution resources to the user, from which the resource on which the job will be run is selected. Once a job has been submitted, additional information necessary to interact with the job in the future—including the resource to which it was submitted, and any resource-specific identifiers that may be required—are written to the job record. At present, submission to clusters running Sun Grid Engine is supported by means of a bespoke submission service, while the Java Commodity Grid Kit (http://dev.globus.org/wiki/CoGjglobus) is used to support submission to Globus resources, such as ScotGrid (http://www.scotgrid.ac.uk) and the UK’s National Grid Service (http://www.ngs.ac.uk). Work is on-going to add gLite- and SAGA-based resources such as TeraGrid (http://www.teragrid.org) to those which are available. Regardless of the chosen resource, all further interaction through the client (e.g. issuing status queries or cancelling jobs) is performed in a common manner.
Much time has been spent working with the end-users of the client applications in order to ensure that the software developed meets their needs. All the clients are provided in a single archive, which includes an installer to ease deployment. In order to further simplify matters, any requisite security steps, e.g. generation of appropriate globus or virtual organization management service proxy credentials, are completed by the client. It is not necessary to install Globus or any other similar package prior to using the clients. It is worth reiterating that these clients have a command-line interface; the original prototype included a Web portal as outlined in Sinnott et al. (2007), but feedback from the electronics researchers suggested that command-line solutions were preferred.
The project’s data management requirements are met by means of a file repository and a service for managing the associated metadata. The file repository uses the open-source implementation of AFS, known as OpenAFS (http://www.openafs.org), to provide a networked, distributed file system which maintains a single name-space across all clients. A capable role-based access control system, coupled with the optional encryption of AFS network traffic, helps to satisfy our security requirements. The principal advantage of AFS is that all machines used by the project can be presented with a unified view of a ‘nano-CMOS file store’ containing all necessary data and applications; this can be accessed almost as if it were a local file system, and thereby virtually eliminates the need to copy files back and forth. Furthermore, AFS was found to offer better performance for data management than other current solutions (Sinnott et al. 2008).
The data service provides the means to model both job and file records, which can be accessed programmatically (i.e. by an application), and also by way of a Web interface (i.e. by a user). Both types of record have a set of core metadata, specific to the type (e.g. file location on AFS for file records) and arbitrary application-specific metadata annotations, the format of which is controlled by the application which produces them. Because the annotations produced by each research application are specific to its application domain, this approach lets the scientists choose the metadata content and format they need, as opposed to a more rigid system whereby the e-science engineer would fix the structure when designing the system, leaving little space for the science application (often an evolving research product) to change in the future.
4. Optimizing logic cells
A system named MOTIVATED (multi-objective toolkit for intrinsic variability aware transistor-level evolutionary design) has been implemented which allows for the inclusion of RandomSpice within the evolutionary cycle of a multi-objective algorithm based on NSGA-II (Non-dominated Sorting GA II; Deb et al. 2002). MOTIVATED has previously been used to evolve CMOS topologies using a modified form of Cartesian genetic programming (CGP) and optimize the transistor widths of existing CMOS topologies using a GA (Walker et al. 2008, 2009; Hilder et al. 2009). In both cases, the main aim was to minimize the effect of intrinsic variability upon the designs produced. However, owing to the intensive time requirements needed to evaluate variability tolerance by performing statistical NGSPICE simulations, a two-stage process was adopted. First, the circuits were optimized by the GA for high speed and low power using the uniform transistor models. The final population was then passed to the second stage and optimized for variability tolerance using the variability-enhanced models. However, the two-stage approach is not ideal, as the initial population for the second stage is strongly biased towards high-speed and low-power designs, which may have to decrease in fitness in order to minimize variability. Therefore, it is possible that variability-tolerant designs may not be discovered. Also, the second stage of optimization is much shorter than the first stage (between 5% and 25%) owing to the computational overhead, which may also hinder the search for variability-tolerant designs. If the variability-enhanced models were used during the entire optimization procedure and an adequate number of variability evaluations were used per design, it would take approximately 35 CPU years on a single desktop machine.
In this paper, MOTIVATED has been used to optimize transistor widths of existing CMOS topologies for performance (i.e. high speed, low power) and variability tolerance in a single stage, which uses the variability-enhanced models throughout the evolutionary cycle. This is achieved by exploiting the parallelism of the design evaluation stage (i.e. the statistical NGSPICE simulations and objective calculations) through the use of HPC resources via the nano-CMOS grid. An example of the evolutionary cycle is shown in figure 2. In our previous work, the statistical NGSPICE simulations and objective calculations of potential solutions accounted for approximately 99 per cent of the overall CPU time. By parallelizing this part of the algorithm, it is possible to increase the overall performance and also perform more detailed and accurate statistical NGSPICE simulations. The goal is to extract designs for each test circuit that exhibits improved variability tolerance, alongside both higher speed and lower power, than the reference designs. This goal would not be possible without the use of the nano-CMOS grid.
(a) Representation and decoding process
The GA takes as its input a template file containing the transistor definitions from the netlist of the cell to optimize. The template file contains special marker strings, which denote the position of the transistor widths to be optimized. The algorithm works by replacing these strings with numerical values determined by the values in the genotype. Each marker string has a number of parameters; the first is the index, which indicates which value within the genotype will be used. This allows one gene to affect the width of multiple transistors. The remaining parameters are used to define the lower (rlow,i) and upper (rhigh,i) limits for the transistor width value, and a multiplication factor (mi), which is used to convert the transistor width from a unit width to the actual width. In this paper, a multiplication factor of 35 is used, as the transistors are assembled from subcircuits of 35×35 nm square transistors.
The genotype itself is constructed from a fixed-length list of integers, each of which is within the range [0,232−1]. The length of the genotype is determined by the number of unique indexes in the template file. Each integer value, gi, in the genotype is decoded using the parameters from the template file and equation (4.1), and the width values (wi) are inserted into the transistor definitions from the template file, 4.1
Once the genotype has been decoded, an input, supply and load stage are added to the transistor definitions to form the complete netlist, as illustrated in figure 3. This arrangement allows the voltage and current at the inputs, supply and load to be measured, and allows realistic circuit loads to be connected to produce feasible results. The input signals for testing two-input logic circuits are created using a pair of synchronous-pulse sources. One input is held logic low for two clock cycles then high for two clock cycles, the other source is held low for three clock cycles then high for two clock cycles. An NGSPICE transient analysis is used to observe the voltages and currents over a period of 21 clock cycles. The chosen signals ensure that all possible input state changes are evaluated through the course of the analysis.
(b) Fitness objectives
In order to incorporate variability into the fitness objective scores, a batch of NGSPICE simulations are performed for a cell using a randomized set of models from RandomSpice. The fitness objectives used for the optimization process are calculated using the data from the entire batch and are shown in table 1. The scores are all adjusted such that a lower score is considered preferable; any circuit that fails to pass the functionality test is penalized with the worst score.
5. Experimental results
Four different logic cells (Buffer, NAND, OR and XOR) were chosen from a commercial SCL to test the system, which range in size from 4 to 10 transistors. The cell layouts were translated from the commercial library to a 35 nm process in order to use the 35 nm variability-enhanced models. Each of the cells is converted to an NGSPICE netlist, in which each NMOS and PMOS transistor is assigned a width in the range [1,8] and [1,12] units, respectively, where each unit is a 35 nm wide transistor. These ranges were chosen to encompass the scaled sizings of the commercial cells, and allow a certain degree of flexibility with respect to the allocated space available.
MOTIVATED was run on each of the four test problems using the parameters shown in table 2. From the final set of promoted parents for each cell, the circuit which demonstrated a potential improvement over the reference design was chosen. This decision was made based on the scores for all of the objectives, but priority was given to the two delay variation objectives: range of delay and σ delay. Therefore, the selected circuit appears to be the most variability-tolerant design from the population. The extracted circuit and the reference design then undergo a further evaluation stage, where 100 000 RandomSpice simulations are performed for each design and the set of objective scores are once again calculated. This ensures that the statistical data have a greater degree of accuracy regarding the impact of intrinsic variability than the data from the MOTIVATED run. The results of the further evaluation stage for the four cells tested are shown in figure 4, which shows a comparison of the worst-case delay and power score for all 100 000 simulations of each design.
From figure 4, it can be seen that the MOTIVATED algorithm has managed to optimize the transistor widths for the NAND, Buffer, OR and XOR cells, which has resulted in an improvement in the worst-case delay and a reduction to the power score characteristics, respectively. It can be seen that only a marginal reduction in worst-case delay was achieved for the NAND cell, while more noticeable improvements were observed for the remaining cells. This could be attributed to the length of the transistor chain within the cell. The NAND cell only has a chain of length 1 (i.e. the drain of a transistor is not connected to the gate of another transistor), whereas the other cells have a chain length of 2 or higher. This suggests that larger cells may benefit the most from optimization.
From figure 4, it can also be seen from the shapes of the scatter clouds and the distributions of the delay histograms that all four of the optimized cells show a greater degree of variability tolerance than the reference designs. In all four optimized cells, the spread of the delay distribution is much tighter and the tail of the distribution has been greatly reduced. This is important as it is the outliers in the distribution that will affect the overall timing yield of a circuit designed using these cells, and will undoubtedly determine whether the circuit will operate successfully.
The simulations performed as part of the research discussed herein required approximately 5 CPU years of processing time, making such an undertaking impractical without recourse to HPC resources; as suggested in the previous section, the time requirements would be even more onerous if a greater number of variability runs were used during the optimization process. The nano-CMOS grid infrastructure has been developed to provide simplified access to such resources, allowing researchers to take advantage of these without incurring additional costs. To date, the nano-CMOS grid infrastructure has delivered over 2.6 million CPU hours to the nano-CMOS virtual organization on ScotGrid alone, and work is currently ongoing to incorporate international resources such as TeraGrid. We note that, in addition to providing improved access to resources, the nano-CMOS grid provides benefits with regards to the management of data and metadata, without which the undertaking of such large simulation ensembles would be most cumbersome.
This paper has introduced the nano-CMOS grid and demonstrated a grid-based methodology to optimize the transistor widths within standard cell designs using statistical transistor models of future devices which contain intrinsic variability. The results show that, by optimizing the four circuits tested, it is possible to demonstrate performance (in terms of delay and power) beyond that which is achieved when using standard design methodologies. Also, the results show that the impact of intrinsic variability on circuit delay and power consumption can be minimized by the optimization of transistor widths. This in turn may potentially improve the overall yield of larger designs in future process technologies that were designed using the variability-tolerant standard cells.
None of the results presented in this paper would have been possible without the use of the nano-CMOS grid. This has provided a number of benefits which were not feasible beforehand. Firstly, the distribution and parallelization of the cell design evaluation stage across a compute cluster allows the variability-enhanced I–V models to be used throughout the entire optimization stage. This allows for the possibility of actually being able to discover optimized variability-tolerant standard cell designs, as the GA used in MOTIVATED will contain information regarding the intrinsic variability of the cell design. Secondly, the number of NGSPICE simulations performed for each cell design using the variability-enhanced I–V models can now be greatly increased, which will significantly improve the accuracy, reliability and trustworthiness of the results. Finally, the nano-CMOS grid provides support for easy and secure access to the data generated from the circuit simulations, along with the ability to securely share the optimized standard cells with partner universities in the project, who can subsequently use this information to perform further simulations at a higher level of the design hierarchy.
The authors would like to thank all the partners in the nano-CMOS project, especially the Device Modelling Group at the University of Glasgow for providing the variability-enhanced models and the RandomSpice application. Nano-CMOS is an EPSRC-funded project (ref: EP/E001610/1).
One contribution of 16 to a Theme Issue ‘e-Science: past, present and future I’.
- © 2010 The Royal Society