Carry-Lookahead Adder: Faster Than Ripple-Carry

Denny

May 7, 2026 7 min read

A fast carry-generation tree feeding multiple full-adder slices in parallel, with carry propagation paths glowing.

TL;DR: A carry-lookahead adder computes every carry bit in parallel from per-bit generate ( $G_i = A_i \cdot B_i$ ) and propagate ( $P_i = A_i \oplus B_i$ ) signals, dropping the addition’s critical path from $O(n)$ to $O(\log n)$ at the cost of a wider, more complex carry network.

A 4-bit ripple-carry adder is straightforward: chain four full adders, route each $C_{out}$ into the next $C_{in}$ . The trouble starts when you ask how fast it goes. The most-significant bit cannot finalize its sum until the carry from the LSB has rippled through every intermediate stage, one full-adder delay per stage. For a 64-bit ALU, that’s $64 \times t_{FA}$ on the critical path — completely unacceptable for a modern processor running at gigahertz clock rates.

Carry-lookahead adders (CLAs) attack the problem at its source: they compute every carry in parallel from a small set of intermediate signals derived directly from the inputs.

What Makes Ripple-Carry Slow?

Recall the full-adder equations:

$S_i = A_i \oplus B_i \oplus C_i$ $C_{i+1} = A_i B_i + (A_i \oplus B_i) C_i$

The sum $S_i$ depends on $C_i$ . The carry $C_{i+1}$ also depends on $C_i$ . So the chain of dependencies is $C_0 \to C_1 \to C_2 \to \dots \to C_n$ . If a single carry stage takes $t_c$ time, the worst-case total addition time is approximately:

$T_{ripple} \approx n \cdot t_c + t_{sum}$

Linear in word width. For 4-bit words this might be 8 nanoseconds, perfectly acceptable. For 32 or 64-bit words it dominates the cycle time of the entire processor. This is the kind of propagation-delay limit that has driven CPU architecture for fifty years.

Generate and Propagate

Look again at the carry equation:

$C_{i+1} = A_i B_i + (A_i \oplus B_i) C_i$

The first term, $A_i B_i$ , is true exactly when both operand bits are 1 — in which case stage $i$ unconditionally produces a carry-out, regardless of what arrives at $C_i$ . Define:

$G_i = A_i \cdot B_i$

This is the generate signal — stage $i$ generates a carry on its own.

The second term, $(A_i \oplus B_i) C_i$ , says: if exactly one of the operand bits is 1, then whatever arrives at $C_i$ propagates straight through. Define:

$P_i = A_i \oplus B_i$

This is the propagate signal — stage $i$ passes any incoming carry to its output.

The carry equation collapses to:

$C_{i+1} = G_i + P_i \cdot C_i$

This is the AND gate and OR gate you already know, doing exactly the work their truth tables describe — but now combined into a recurrence that you can flatten.

Flattening the Carry Recurrence

Substitute the recurrence into itself. Starting from a known $C_0$ :

$C_1 = G_0 + P_0 C_0$

$C_2 = G_1 + P_1 C_1 = G_1 + P_1(G_0 + P_0 C_0) = G_1 + P_1 G_0 + P_1 P_0 C_0$

$C_3 = G_2 + P_2 C_2 = G_2 + P_2 G_1 + P_2 P_1 G_0 + P_2 P_1 P_0 C_0$

$C_4 = G_3 + P_3 G_2 + P_3 P_2 G_1 + P_3 P_2 P_1 G_0 + P_3 P_2 P_1 P_0 C_0$

The pattern is clean. Each carry $C_{i+1}$ is a sum-of-products over the generate and propagate signals from stages $0$ through $i$ . The longest term in $C_4$ has five inputs ANDed together — but importantly, all four carries can be computed simultaneously because none of them depends on any of the others. They all depend only on $G_0 \dots G_3$ , $P_0 \dots P_3$ , and $C_0$ , which are derived directly from the inputs in two gate delays.

The 4-Bit CLA Architecture

Three logical layers:

Generate/propagate layer. $n$ AND gates produce $G_i = A_i B_i$ in one delay; $n$ XOR gates produce $P_i = A_i \oplus B_i$ in one delay. These run in parallel.
Carry network. A wide AND-OR tree computes $C_1, C_2, C_3, C_4$ from the formulas above. Two gate delays (one AND, one OR) regardless of how many bits.
Sum layer. $n$ XOR gates produce $S_i = P_i \oplus C_i$ in one delay. (Note that $S_i = A_i \oplus B_i \oplus C_i = P_i \oplus C_i$ once $P_i$ is computed.)

Total critical path: 4 gate delays for any width up to the point where the carry network itself blows out. Compare with $2n$ for ripple-carry. For a 16-bit adder, that’s roughly 4 versus 32 — an 8x speedup.

Block Diagram of a 4-Bit CLA

A0,B0 ─┬── XOR ── P0 ──┐                           ┌── XOR ── S0
       └── AND ── G0 ──┤                           │
                       │   ┌─── carry network ───┐ │
A1,B1 ─┬── XOR ── P1 ──┤   │                     │ ├── XOR ── S1
       └── AND ── G1 ──┤   │   computes          │ │
                       ├──>│   C1, C2, C3, C4    │─┤
A2,B2 ─┬── XOR ── P2 ──┤   │   in 2 gate delays  │ ├── XOR ── S2
       └── AND ── G2 ──┤   │                     │ │
                       │   └─────────────────────┘ │
A3,B3 ─┬── XOR ── P3 ──┤                           ├── XOR ── S3
       └── AND ── G3 ──┘                           │
                                                   └────────── C4
C0 ──────────────────────>┘

Gate Count: The Hidden Cost

The carry network grows quadratically in its fanin. For $C_4$ , you need a 5-input AND for the full propagation chain; for $C_{16}$ , a 17-input AND. Real silicon doesn’t have arbitrary-input AND gates — wide gates decompose into trees, adding delay. So a “flat” CLA scales only up to about 4 to 8 bits before you start paying for tree depth.

The standard fix is hierarchy. Group bits into 4-bit blocks, each block computes its own block-level generate $G^*$ and propagate $P^*$ signals, and a second-level CLA combines those. For a 16-bit adder you get four 4-bit CLA blocks plus one 4-bit lookahead unit on the block carries. Critical path: roughly $\log_4 n$ levels of lookahead times a constant.

Comparison Table

Design	16-bit critical path	Gate count
Ripple-carry	~32 gate delays	~80
Single-level 4-bit CLA blocks, ripple between blocks	~16 gate delays	~120
Two-level 16-bit CLA	~8 gate delays	~200
Kogge-Stone (full parallel-prefix)	~5 gate delays	~400
Brent-Kung (sparse prefix)	~7 gate delays	~150

(Rough estimates — actual numbers depend on gate library and optimization.)

The takeaway: every halving of delay roughly doubles gate count. CPU designers pick the spot on the curve that matches their cycle-time budget. ARM cores at 1 GHz can use Brent-Kung; AMD/Intel x86 cores at 5 GHz use full Kogge-Stone or worse.

Beyond CLA: Parallel-Prefix Adders

The carry-lookahead idea generalizes. The recurrence $C_{i+1} = G_i + P_i C_i$ is associative under a custom operator — you can pair up adjacent positions, then pairs of pairs, in a tree. This is the parallel-prefix family:

Kogge-Stone — fastest, uses $O(n \log n)$ cells, each level fans out to many.
Brent-Kung — half the gates of Kogge-Stone, one extra logic level.
Han-Carlson, Ladner-Fischer, Sklansky — different trade-offs of fanout, wiring complexity, and depth.

Modern high-end ALUs use a hybrid — Kogge-Stone in the upper bits where speed dominates, Brent-Kung in the lower bits where area matters. The exact choice is a circuit-design specialty.

Implementing a CLA in DigiSim

You can build a 4-bit CLA from primitives — four AND gates for the generates, four XOR gates for the propagates, the carry tree from the equations above, and four XOR gates for the sums. The ADDER and ADDER_8BIT components in DigiSim use carry-lookahead internally to keep delay reasonable. To compare against the slow path, open the full-adder-with-carry template, chain four instances together as ripple-carry, and time the critical path against an equivalent CLA. The difference grows visibly as you push to 8 and 16 bits.

If you want to characterize the gate-delay scaling experimentally, the simulator’s logic analyzer can tag the moment $A$ and $B$ change versus the moment $S_3$ and $C_4$ stabilize. Run the experiment with both architectures across word widths and you’ll see the linear vs logarithmic scaling directly.

Common Pitfalls

Forgetting that $P_i$ uses XOR, not OR. Either definition — $P_i = A_i + B_i$ or $P_i = A_i \oplus B_i$ — produces a correct carry, because the $A_i = B_i = 1$ case is already covered by $G_i$ . The reason to prefer XOR is that the same wire feeds the sum stage: $S_i = P_i \oplus C_i$ . With OR-defined $P_i$ that reuse breaks down and you need a separate XOR for the sum.
Underestimating fanin scaling. A “16-bit CLA” implies a 17-input AND somewhere unless you go hierarchical. Real designs always go hierarchical.
Assuming CLA is always faster. For very narrow words (4 bits or fewer) the constants in CLA exceed ripple-carry’s. The crossover is around 6 to 8 bits.
Mixing up sum and carry critical paths. The carry network’s depth dominates total delay, but the final sum XOR adds one more gate delay. Specifying delay as “carry-network depth” is a common error.

What’s Next

The next post in this series, How an ALU Works: Arithmetic Logic Unit From Gates, integrates the CLA into a full multi-function arithmetic unit alongside subtraction, AND, OR, and shift operations. Following that, Booth Multiplier Explained (With Examples) applies a similar speedup philosophy to multiplication — using a different algorithmic trick to skip past long runs of zeros and ones.

To compare the two adder architectures hands-on, open the full-adder-with-carry template, chain four instances as a ripple-carry adder, and watch the bit-by-bit carry propagation on the logic analyzer. Then build the equivalent 4-bit CLA from the equations in this post and time both circuits — the difference is the entire reason high-speed ALUs exist.