arithmetic-circuits

Carry-Lookahead Adder: Faster Than Ripple-Carry

Denny Denny
7 min read
A fast carry-generation tree feeding multiple full-adder slices in parallel, with carry propagation paths glowing.

TL;DR: A carry-lookahead adder computes every carry bit in parallel from per-bit generate (Gi=AiBiG_i = A_i \cdot B_i) and propagate (Pi=AiBiP_i = A_i \oplus B_i) signals, dropping the addition’s critical path from O(n)O(n) to O(logn)O(\log n) at the cost of a wider, more complex carry network.

A 4-bit ripple-carry adder is straightforward: chain four full adders, route each CoutC_{out} into the next CinC_{in}. The trouble starts when you ask how fast it goes. The most-significant bit cannot finalize its sum until the carry from the LSB has rippled through every intermediate stage, one full-adder delay per stage. For a 64-bit ALU, that’s 64×tFA64 \times t_{FA} on the critical path — completely unacceptable for a modern processor running at gigahertz clock rates.

Carry-lookahead adders (CLAs) attack the problem at its source: they compute every carry in parallel from a small set of intermediate signals derived directly from the inputs.

What Makes Ripple-Carry Slow?

Recall the full-adder equations:

Si=AiBiCiS_i = A_i \oplus B_i \oplus C_i Ci+1=AiBi+(AiBi)CiC_{i+1} = A_i B_i + (A_i \oplus B_i) C_i

The sum SiS_i depends on CiC_i. The carry Ci+1C_{i+1} also depends on CiC_i. So the chain of dependencies is C0C1C2CnC_0 \to C_1 \to C_2 \to \dots \to C_n. If a single carry stage takes tct_c time, the worst-case total addition time is approximately:

Tripplentc+tsumT_{ripple} \approx n \cdot t_c + t_{sum}

Linear in word width. For 4-bit words this might be 8 nanoseconds, perfectly acceptable. For 32 or 64-bit words it dominates the cycle time of the entire processor. This is the kind of propagation-delay limit that has driven CPU architecture for fifty years.

Generate and Propagate

Look again at the carry equation:

Ci+1=AiBi+(AiBi)CiC_{i+1} = A_i B_i + (A_i \oplus B_i) C_i

The first term, AiBiA_i B_i, is true exactly when both operand bits are 1 — in which case stage ii unconditionally produces a carry-out, regardless of what arrives at CiC_i. Define:

Gi=AiBiG_i = A_i \cdot B_i

This is the generate signal — stage ii generates a carry on its own.

The second term, (AiBi)Ci(A_i \oplus B_i) C_i, says: if exactly one of the operand bits is 1, then whatever arrives at CiC_i propagates straight through. Define:

Pi=AiBiP_i = A_i \oplus B_i

This is the propagate signal — stage ii passes any incoming carry to its output.

The carry equation collapses to:

Ci+1=Gi+PiCiC_{i+1} = G_i + P_i \cdot C_i

This is the AND gate and OR gate you already know, doing exactly the work their truth tables describe — but now combined into a recurrence that you can flatten.

Flattening the Carry Recurrence

Substitute the recurrence into itself. Starting from a known C0C_0:

C1=G0+P0C0C_1 = G_0 + P_0 C_0

C2=G1+P1C1=G1+P1(G0+P0C0)=G1+P1G0+P1P0C0C_2 = G_1 + P_1 C_1 = G_1 + P_1(G_0 + P_0 C_0) = G_1 + P_1 G_0 + P_1 P_0 C_0

C3=G2+P2C2=G2+P2G1+P2P1G0+P2P1P0C0C_3 = G_2 + P_2 C_2 = G_2 + P_2 G_1 + P_2 P_1 G_0 + P_2 P_1 P_0 C_0

C4=G3+P3G2+P3P2G1+P3P2P1G0+P3P2P1P0C0C_4 = G_3 + P_3 G_2 + P_3 P_2 G_1 + P_3 P_2 P_1 G_0 + P_3 P_2 P_1 P_0 C_0

The pattern is clean. Each carry Ci+1C_{i+1} is a sum-of-products over the generate and propagate signals from stages 00 through ii. The longest term in C4C_4 has five inputs ANDed together — but importantly, all four carries can be computed simultaneously because none of them depends on any of the others. They all depend only on G0G3G_0 \dots G_3, P0P3P_0 \dots P_3, and C0C_0, which are derived directly from the inputs in two gate delays.

The 4-Bit CLA Architecture

Three logical layers:

  1. Generate/propagate layer. nn AND gates produce Gi=AiBiG_i = A_i B_i in one delay; nn XOR gates produce Pi=AiBiP_i = A_i \oplus B_i in one delay. These run in parallel.
  2. Carry network. A wide AND-OR tree computes C1,C2,C3,C4C_1, C_2, C_3, C_4 from the formulas above. Two gate delays (one AND, one OR) regardless of how many bits.
  3. Sum layer. nn XOR gates produce Si=PiCiS_i = P_i \oplus C_i in one delay. (Note that Si=AiBiCi=PiCiS_i = A_i \oplus B_i \oplus C_i = P_i \oplus C_i once PiP_i is computed.)

Total critical path: 4 gate delays for any width up to the point where the carry network itself blows out. Compare with 2n2n for ripple-carry. For a 16-bit adder, that’s roughly 4 versus 32 — an 8x speedup.

Block Diagram of a 4-Bit CLA

A0,B0 ─┬── XOR ── P0 ──┐                           ┌── XOR ── S0
       └── AND ── G0 ──┤                           │
                       │   ┌─── carry network ───┐ │
A1,B1 ─┬── XOR ── P1 ──┤   │                     │ ├── XOR ── S1
       └── AND ── G1 ──┤   │   computes          │ │
                       ├──>│   C1, C2, C3, C4    │─┤
A2,B2 ─┬── XOR ── P2 ──┤   │   in 2 gate delays  │ ├── XOR ── S2
       └── AND ── G2 ──┤   │                     │ │
                       │   └─────────────────────┘ │
A3,B3 ─┬── XOR ── P3 ──┤                           ├── XOR ── S3
       └── AND ── G3 ──┘                           │
                                                   └────────── C4
C0 ──────────────────────>┘

Gate Count: The Hidden Cost

The carry network grows quadratically in its fanin. For C4C_4, you need a 5-input AND for the full propagation chain; for C16C_{16}, a 17-input AND. Real silicon doesn’t have arbitrary-input AND gates — wide gates decompose into trees, adding delay. So a “flat” CLA scales only up to about 4 to 8 bits before you start paying for tree depth.

The standard fix is hierarchy. Group bits into 4-bit blocks, each block computes its own block-level generate GG^* and propagate PP^* signals, and a second-level CLA combines those. For a 16-bit adder you get four 4-bit CLA blocks plus one 4-bit lookahead unit on the block carries. Critical path: roughly log4n\log_4 n levels of lookahead times a constant.

Comparison Table

Design16-bit critical pathGate count
Ripple-carry~32 gate delays~80
Single-level 4-bit CLA blocks, ripple between blocks~16 gate delays~120
Two-level 16-bit CLA~8 gate delays~200
Kogge-Stone (full parallel-prefix)~5 gate delays~400
Brent-Kung (sparse prefix)~7 gate delays~150

(Rough estimates — actual numbers depend on gate library and optimization.)

The takeaway: every halving of delay roughly doubles gate count. CPU designers pick the spot on the curve that matches their cycle-time budget. ARM cores at 1 GHz can use Brent-Kung; AMD/Intel x86 cores at 5 GHz use full Kogge-Stone or worse.

Beyond CLA: Parallel-Prefix Adders

The carry-lookahead idea generalizes. The recurrence Ci+1=Gi+PiCiC_{i+1} = G_i + P_i C_i is associative under a custom operator — you can pair up adjacent positions, then pairs of pairs, in a tree. This is the parallel-prefix family:

  • Kogge-Stone — fastest, uses O(nlogn)O(n \log n) cells, each level fans out to many.
  • Brent-Kung — half the gates of Kogge-Stone, one extra logic level.
  • Han-Carlson, Ladner-Fischer, Sklansky — different trade-offs of fanout, wiring complexity, and depth.

Modern high-end ALUs use a hybrid — Kogge-Stone in the upper bits where speed dominates, Brent-Kung in the lower bits where area matters. The exact choice is a circuit-design specialty.

Implementing a CLA in DigiSim

You can build a 4-bit CLA from primitives — four AND gates for the generates, four XOR gates for the propagates, the carry tree from the equations above, and four XOR gates for the sums. The ADDER and ADDER_8BIT components in DigiSim use carry-lookahead internally to keep delay reasonable. To compare against the slow path, open the full-adder-with-carry template, chain four instances together as ripple-carry, and time the critical path against an equivalent CLA. The difference grows visibly as you push to 8 and 16 bits.

If you want to characterize the gate-delay scaling experimentally, the simulator’s logic analyzer can tag the moment AA and BB change versus the moment S3S_3 and C4C_4 stabilize. Run the experiment with both architectures across word widths and you’ll see the linear vs logarithmic scaling directly.

Common Pitfalls

  • Forgetting that PiP_i uses XOR, not OR. Either definition — Pi=Ai+BiP_i = A_i + B_i or Pi=AiBiP_i = A_i \oplus B_i — produces a correct carry, because the Ai=Bi=1A_i = B_i = 1 case is already covered by GiG_i. The reason to prefer XOR is that the same wire feeds the sum stage: Si=PiCiS_i = P_i \oplus C_i. With OR-defined PiP_i that reuse breaks down and you need a separate XOR for the sum.
  • Underestimating fanin scaling. A “16-bit CLA” implies a 17-input AND somewhere unless you go hierarchical. Real designs always go hierarchical.
  • Assuming CLA is always faster. For very narrow words (4 bits or fewer) the constants in CLA exceed ripple-carry’s. The crossover is around 6 to 8 bits.
  • Mixing up sum and carry critical paths. The carry network’s depth dominates total delay, but the final sum XOR adds one more gate delay. Specifying delay as “carry-network depth” is a common error.

What’s Next

The next post in this series, How an ALU Works: Arithmetic Logic Unit From Gates, integrates the CLA into a full multi-function arithmetic unit alongside subtraction, AND, OR, and shift operations. Following that, Booth Multiplier Explained (With Examples) applies a similar speedup philosophy to multiplication — using a different algorithmic trick to skip past long runs of zeros and ones.

To compare the two adder architectures hands-on, open the full-adder-with-carry template, chain four instances as a ripple-carry adder, and watch the bit-by-bit carry propagation on the logic analyzer. Then build the equivalent 4-bit CLA from the equations in this post and time both circuits — the difference is the entire reason high-speed ALUs exist.