Clocking in Multi-GHz Environment

Vojin G. Oktobdžija

Invited Paper

Abstract: An overview of clocking and design of clocked storage elements is presented. Systematic design of flip-flop is explained as well as "time borrowing" and absorption of clock uncertainties. We show how should different clocked storage elements be compared against each other. The issues related to power consumption and low-power design are presented.

Keywords: Clocking, clock skew, clock jitter, clocked storage elements, flip-flop, master-slave, time borrowing.

1 Introduction

Deciding on the clocking strategy in digital system is one of the single most important decisions. If taken lightly it always turns to be very costly afterwards.

The importance of clocking is gaining momentum as the clock speed rises rapidly; doubling every three years as shown in Fig. 1. At today’s frequencies ability to absorb clock skew or to make the Clocked Storage Element (CSE) faster results in direct performance improvement. These improvements are very difficult to obtain through architectural techniques or micro-architecture level. As the clock frequency reaches 5-10 GHz traditional clocking techniques will be reaching their limit. New ideas and new ways of designing digital systems are required.

With the increase in the level of integration, followed by a speed increase, the number of logic levels in the critical path diminishes. In today’s high-speed processors, instructions are executed in one-cycle, which is driven by

---

Manuscript received March 11, 2002. An earlier version of this paper was presented at the 23rd International Conference on Microelectronics, MIEL 2002, May 12-15, 2002, Nis, Serbia.

The author is with Integration Corp. Berkeley, California, USA and Department of Electrical Engineering, University of California, Davis (e-mail:vojin@ece.ucdavis.edu).
a single-phase clock. In addition the pipeline depth is increasing to 15 or 20 in order to accommodate the speed increase. Today 10 levels of logic in the critical path is more common and this number is expected to be decreasing further as illustrated in Fig. 2. Thus any overhead associated with the clock system and clocking mechanism that is directly and adversely affecting the machine performance is critically important.

Fig. 2. Increase in the clock frequency and decrease in the number of logic levels in the pipeline (courtesy of Shekhar Borkar, Intel Corp.).
1.1 Clock distribution

The two most important timing parameters affecting the clock signal are: Clock Skew and Clock Jitter.

Clock Skew is a spatial variation of the clock signal as distributed through the system. It is caused by the various RC characteristics of the clock paths to the various points in the system, as well as different loading of the clock signal at different points on the chip. Further we can distinguish global clock skew and local clock skew. Both of them are equally important in high-performance system design.

Clock Jitter is a temporal variation of the clock signal with regard to the reference transition (reference edge) of the clock signal as illustrated in Fig. 3. Clock jitter represents edge-to-edge variation of the clock signal in time. As such clock jitter can also be classified as: long-term jitter and edge-to-edge clock jitter, which defines clock signal variation between two consecutive clock edges. In the course of high-speed logic design we are more concerned about edge-to-edge clock jitter because it is this phenomena that affects the time available to the logic.

![Clock parameters](image)

Fig. 3. Clock parameters: period, width, clock skew and clock jitter.

Typically the clock signal has to be distributed to several hundreds of thousands of the clocked storage elements (also known as flip-flops and latches). Therefore, the clock signal has the largest fan-out of any node in the design, which requires several levels of amplification (buffering). As a consequence, the clock system by itself can use up to 40-50% of the power
of the entire VLSI chip [1]. We also must assure that every clocked storage element receives the clock signal precisely at the same moment in time.

There are several methods for the on-chip clock signal distribution attempting to minimize the clock skew and contain the power dissipated by the clock system [18]. The clock can be distributed in several ways of which the two typical cases are: (a) an RC matched tree and (b) a grid shown in Fig. 4.

If we had superior Computer Aided Design (CAD) tools, a perfect and uniform process and ability to route wires and balance loads with a high degree of flexibility, a matched RC delay clock distribution (a) would be preferable to grid (b). However, neither of that is true. Therefore grid is used when clock distribution on the chip has to be very precisely controlled. This is the case in high performance systems.

An example of the clock distribution grid is shown in Fig. 5. The power consumed by the clock is also the highest in cases using grid arrangement. This is not difficult to understand given that in a grid arrangement a high-capacitance plate has been driven by buffers connected at various points.

1.2 Controlling the clock signal arrival uncertainties

Local variations in device geometry and supply voltage are important component of the clock skew. More sophisticated clock distribution than simple RC matched or grid-based schemes are thus necessary. The active schemes with adaptive digital deskewing typically reduce clock skew of the simple passive clock networks by an order of magnitude, allowing tighter control of the clock period and higher clock rates. The digital deskewing circuit for clock distribution compensates out the static components of skew (load, interconnect, and device mismatches).

It also compensates for the dynamic variations of temperature and voltage gradients between the two spines during all phases of microprocessor active operation.

Conceptual diagram of active digital deskewing [3] is shown in Fig. 6. The deskewing circuit equalizes insertion delay in the two clock distribution spines by compensating for delay mismatch in left and right spines of the microprocessor clock network. The circuit is composed of delay lines in both spines, a phase detection circuit, and a controller as illustrated in Fig. 6.

The skew for the clock distribution network in a 7.5 M 0.25 µm technology IA-32 P6 family microprocessor design [3] has ≈60 ps of skew from left to right with the deskewing circuit inactive. With the deskewing circuit active the skew is reduced to 15 ps.
2 Clocked Storage Elements

The function of a clocked storage element, flip-flop or latch, is to capture the information at a particular moment in time and preserve it as long as it is needed by the digital system. It is not possible to define a storage element without defining its relationship to the clock.

2.1 Master-slave latch arrangement

In order to avoid the transparency feature associated with a single latch, an arrangement is made in which two latches are clocked back to back with two non-overlapping phases of the clock. In such arrangement the first latch serves as a "Master" by receiving the values from the Data input and passing them to the "Slave" latch, which simply follows the "Master". This is known as a Master-Slave (M-S) Latch arrangement or L1-L2 latch (in IBM) as shown in Fig. 7. This is not to be confused with the "flip-flop", though
it seems that many practitioners today do erroneously call the arrangement shown in Fig. 7. (b), a flip-flop (F-F). We insist on the terminology that distinguishes flip-flop from M-S Latch and we will explain the fundamental differences between the F-F and M-S Latch in this paper.

In a Master-Slave arrangement the "Slave" latch can have two or more masters acting as an internal multiplexer with storage capabilities. The first "Master" is used for capturing of data input while the second Master can be used for other purposes such as scan-input for testing purposes, and clocked with a separate clock. One such arrangement, which utilizes two Masters, is a well-known IBM Level-Sensitive-Scan-Design [4].
2.2 Flip-flop

Flip-Flop and Latch operate on different principles. While Latch is "level-sensitive" which means it is reacting on the level (logical value) of the clock signal, flip-flop is "edge sensitive" which means that the mechanism of capturing the data value on its input is related to the changes of the clock. Thus, the two are designed for a different set of requirements and thus consist of inherently different circuit topology. Level sensitivity implies that the latch is capturing data value during the entire period of time when clock is active (logic one) while the latch is transparent. The capturing process in the flip-flop occurs only during the transition of the clock, thus the flip-flop is non-transparent. However, even the flip-flop can have a small period of transparency associated with the narrow window during which the clock changes.

A general structure of the flip-flop is shown in Fig. 8. The difference between a flip-flop structure (Fig. 8) and that of the M-S Latch arrangement (Fig. 7) should be noticed. A flip-flop consists of two stages: (a) Pulse Generator-PG and (b) Capturing Latch-CL. The pulse generator PG generates a negative pulse on either $\overline{S}$ or $\overline{R}$ lines, which are normally held at logic "one" level. This pulse is a function of Data and Clock signals and should be of a sufficient duration to be captured in the capturing latch CL. The duration of that pulse can be as long as half of the clock period or it can be as short as one inverter delay. On the contrary M-S Latch generally consists of two identical clocked latches and its non-transparency feature is
achieved by non-overlapping clocks $\phi_1$ and $\phi_2$, clocking master latch $L_1$ and slave latch $L_2$. The relationship of $S$ and $R$ signals with respect to Data ($D$) and Clock ($Clk$) signal can be expressed as

$$S_n = Clk\overline{R}(D + S) \quad \text{and} \quad R_n = Clk\overline{S}(\overline{D} + R)$$ (1)

Those two equations (1) form a basis for derivation of a flip-flop structure.

![Flip-flop structure](image)

Fig. 8. General flip-flop structure.

Simply stated, the equation for $S_n$ tells us that: The next state of this flip-flop will be set to ”one” only at the time the clock becomes ”one” (raising edge of the clock), the data at the input is ”one”, the flip flop is in the ”steady state” (both $S$ and $R$ are ”zero”). The moment flip-flop is set ($S=1$, $R=0$) no further change in data input can affect the flip-flop state: data input will be ”locked” to set by $(D+S)=1$, and reset $R_n$ would be disabled (by $S=1$).

This assures the ”edge sensitivity”-i.e. after the transition of the clock and setting of the $S$ or $R$ signal to its desired state, the flip-flop is ”locked” for receiving a new data.

It is interesting that it took engineers several attempts to come to the right circuits topology of this flip-flop. The flip-flop used in the third generation of Digital Equipment Corp. 600 MHz Alpha [1] processor used a version of the flip-flop introduced by Madden and Bowhill, which was based on the static memory cell design [5]. This particular flip-flop is known as sense amplifier flip-flop (SAFF). Development of the pulse generator block of this flip-flop is illustrated in Fig. 9. A substantial improvement in speed is achieved by modification of the second stage by Stojanovic (US Patent No. 6,232,810) [6].
Fig 9. Pulse generator stage of the sense amplifier flip-flop: (a) Madden and Bowhill [5], (b) Improvement for floating nodes, Doerpuhl [9], (c) improvement by proper design: second stage (Stojanovic, US Patent: 6,232,810), first stage [10].
2.3 Time window based flip-flops

Digital circuits are based on discrete events. The time reference is a clock signal and/or finite delay through one or more logic elements. To generate a needed time reference, a pulse created by the property of re-convergent fan-outs, with non-equal parities of inversion is commonly used. This method is illustrated in Fig. 10 on HLFF flip-flop introduced by Partovi [7]. The trailing edge of this short pulse is used as a time reference for shutting the Flip-Flop off. A short "Time Window" is created during which flip-flop is accepting data, which is the way of creating "edge" in digital world. Rigorous analysis of HLFF shows design incompleteness resulting in imperfections of the 1-1 transition, which was demonstrated later.

A flip-flop based on the same described principle was introduced by Klass [8], Fig. 11.

It uses a NAND gate to inhibit any further changes, and lock the existing ones after the time window has elapsed. It is characterized with one of the highest performance but suffers the same problem of HLFF. The problem is in the floating output node, which is susceptible to glitches and even slightest mismatch of clock signals.

A systematic approach in deriving a single-ended flip-flop is shown in Fig. 12. This flip-flop has three time reference points: (a) Clock signal: $Cclk$ (b) Clock signal passed through three inverters: $Cclk_3$, (c) Clock passed through two inverters: $Cclk_2$. The equations describing the pulse generator stage of this flip-flop is given by

$$\overline{S} = X = \frac{(Cclk + Cclk_2)(D \cdot Cclk_3 + \overline{X})}{Cclk_2 + Cclk_3}$$

The nMOS transistor section is a full realization of this equation. The pMOS section is somewhat abbreviated for performance reasons to

$$X = \frac{(Cclk + Cclk_2)(Cclk_3 + \overline{X})}{Cclk_2 + Cclk_3}$$

The second stage (capturing latch) is implemented as

$$Q = X(Cclk_2 + Q)$$

This systematically derived flip-flop [11] does not have hazards in the output stage and is outperforming HLFF [7] and SDFF flip-flops [8].

3 Timing Parameters

Data and Clock inputs of a clocked storage element need to satisfy basic timing restrictions to ensure correct operation of the flip-flop. Fundamental
timing constraints between data and clock inputs are quantified with setup and hold times, as illustrated in Fig. 13. Setup and hold times define time intervals during which input has to be stable to ensure correct flip-flop operation. The sum of setup and hold times define the "sampling window" of the clocked storage element.

3.1 Setup and hold time properties

Failure of the clocked storage element due to the Setup and Hold time violations is not an abrupt process. This failing behavior is shown in Fig. 13. Considering how close should data be allowed to change with respect to the locking event, we encounter two opposing requirements: (a) it should be kept further from the failing region for the purpose of design reliability. (b) it should be as close to the clock in order to increase the time available for the
logic operation. This is an obvious dilemma. In some designs an arbitrary number of 5-20% is used. Setup and Hold times are defined as points in time when the $Clk - Q$ ($t_{CQ}$) delay raises for that amount. We do not find this reasoning to be valid.

A redrawn picture, Fig. 14, where $D - Q$ ($t_{DQ}$) delay is plotted (instead of Clk-Q), provides more information. From this graph we see that in spite of Clock-Q delay rising, we are still gaining because the time taken from the cycle is reduced.
3.2 Time borrowing and absorption of clock uncertainties

The increase in delay from the storage element is still smaller than the amount of delay introduced to the cycle, thus allowing more time for the useful logic operation. This is known as: "time borrowing", "cycle stealing" or "slack passing". In order to understand the full effects of delayed data arrival we have to consider a pipelined design where the data captured in the first clock cycle is used as input in the next clock cycle as shown in Fig. 15.

As it can be seen in Fig. 15, the "sampling window" moves around the time axes. As the data arrive closer to the clock, the size of the "sampling window" shrinks (up to the optimal point). Even though, the sampling window is smaller, the data in the next cycle will still arrive later compared
to the case where the data in the previous cycle was ahead of the setup-time. The amount of time for which the TCR1 was augmented did not come for free. It was simply taken away ("stolen" or "borrowed") from the next cycle TCR2. As a result of late data arrival in the Cycle 1 there is less time available in the Cycle 2. Thus a boundary between pipeline stages is somewhat flexible. This feature not only helps accommodate a certain amount of imbalance between the critical paths in various pipeline stages, but it helps in absorbing the clock skew and jitter. Thus, "time borrowing" is one of the most important characteristics of today’s high-speed digital systems. Absorption of the clock jitter in HLFF is shown in Fig. 16.

![Fig. 16. Clock jitter absorbing properties of HLFF [7].](image)

The maximal clock skew that a system can tolerate is determined by clock storage elements. If the clock-to-output delay of a clocked storage element is shorter than the hold time required and there is no logic in between two storage elements, a race condition can occur. A minimum delay restriction on the clock-to-output delay given by:

\[
\tau_{CLK-Q} \geq \tau_{hold} + \tau_{skew}
\]

If this relation is satisfied, the system is immune to hold time violations. Otherwise, it is necessary to check that all the timing paths have some minimal delay, which assures that there is no hold time violation.
4 Characterization

4.1 Power and energy

It is important to emphasize the sources of power consumed in the clocked storage element (CSE) and the correct set-up for the characterization and comparison. Power consumed by a CSE comes from various sources of which power-supply ($V_{DD}$) is only one. Using $V_{DD}$ as a point for measuring power consumption can be misleading. Some CSE, characterized with low internal power consumption, represent a considerable load on the clock distribution network, thus taking considerable amount of power from the clock. Power can be drawn from the Data input as well. Therefore the total power $P_{tot}$ should account for all the possible power sources supplying the CSE [12].

$$P_{tot} = P_{internal} + \sum_{inputs(D,CLK)} P_{driver}$$ (6)

Fig. 17. Sources of power consumption in a CSE.

4.2 Delay

In characterizing delay it is only appropriate to take into account the amount of time taken from the cycle $T$ due to the insertion of the CSE. This represents delay ($t_{DQ}$) as it was discussed in 3. The question is whether this delay should be $D - Q$, $D - \overline{Q}$ or the worse of the two? We strongly argue that it is the most appropriate to characterize the CSE with the worse of the two delays since the critical path in a design may impose that scenario. Another
question is that of the output load. It is only reasonable that the load on the output: \( Q, \overline{Q} \) be representative of the conditions existing in a real design. In our measurements we use 14 minimal size inverters (in the same technology) as a representative load. Finally the remaining question is: should we load only the output producing the longer delay or both: \( Q, \overline{Q} \)? We performed our measurements by loading only the worse of the two. This is justified by the fact that the critical path can always be improved by duplicating the CSE, and reducing the load to zero on the output that is not in the critical path. This is the approach that is taken by a reasonable designer and a synthesis tool as well.

4.3 Figure of merit

It is well known that power can always be traded for speed and that superior speed can always be obtained by allowing for higher power consumption. Thus, it is hard to tell which one of the two CSE compared against each other is better. Various figures of merit have been used in the past. One commonly used and grossly misleading factor is Power-Delay-Product (PDP). It is not difficult to prove that PDP would always favor slower design, given that the energy consumed depends on the clock speed as well. It has been shown that more appropriate figure of merit is Energy-Delay-Product (EDP), [16]. However, some recent results argue that ED^2P is more appropriate [19]. In our measurements we use PDP at a fixed frequency, which represents EDP.

5 Design for Low Power

The energy consumed in a clocked storage element is approximated by:

\[
E_{\text{switching}} = \sum_{i-1}^{N} a_{i-1}(i) C_i V_{\text{swing}}(i) V_{DD} \quad (7)
\]

where \( N \) is the number of nodes in a clocked storage element, \( C_i \) is the node capacitance, \( a_{i-1}(i) \) is the probability that a transition occurs at node \( i \), and \( V_{\text{swing}} \) is the voltage swing of node \( i \). Starting from (7), several commonly used techniques applied to minimize energy consumption can be derived:

(a) Reducing the number of active nodes and assuring that when they are switching the capacitance is minimized,

(b) Reducing the voltage swing of the switching node,

(c) Reducing the voltage (technology scaling),
(d) Reducing the activity of the node.

The approaches listed in (a)-(d) result in several known techniques used in low-power applications. One of the most common is "clock gating" which assures that the storage elements in an inactive part of the processor are not switching. A thorough review of the common techniques for low-power can be found in [13]. In this paper we describe some recent techniques applicable to low-power design of clocked storage elements.

5.1 Conditional capture flip-flop

Conditional capture technique attempts to minimize unnecessary switching of the CSE. One such structure is CCFF [14], which operates on the principle of J-K Flip-Flop: data can affect the flip-flop only if it will result in the change of the output. An improved version of CCFF is presented in [15] which reduces the overall Energy-Delay Product by up to 14% in for 50% data activity, while total power saving is more than 50% with quiet inputs (Fig. 18.). CSE equipped with conditional features have advantageous properties in low data activity conditions. However, conditional techniques are suitable for applications in the high-performance circuits as well.

![Fig. 18. Conditional capture flip-flop [15].](image)

5.2 Conditional precharge flip-flop

Conditional Precharge flip-flop (CPFF) [15] is shown in Fig. 19. It eliminates power consuming precharge operation in dynamic flip-flops when it is not required.
6 Conclusion

A review of some (but not all) of the techniques for high performance and low-power CSE design is presented. For complete analysis of representative CSE please visit: www.ece.ucdavis.edu/acsel where extensive database of comparative results exist. In the future we expect that pipeline boundaries will start to blur and synchronous design will be possible only in limited domains on the chip.

Acknowledgement

I gratefully acknowledge contribution from my students: Nikola Nedović, Marko Aleksić, Bart Zeydel, Hoang Dao and Xiao-Yan Yu.

References


