arithmetic-circuits

Booth Multiplier Explained (With Examples)

Denny Denny
8 min read
A multi-bit Booth multiplier with shift register and accumulator stages, partial products being summed.

TL;DR: Booth’s algorithm multiplies signed binary numbers in two’s complement by scanning the multiplier two bits at a time and choosing one of three actions per step: add the multiplicand, subtract it, or do nothing. It handles signed values natively and skips runs of identical bits, yielding fewer operations than naive shift-and-add for many real-world inputs.

The textbook way to multiply two binary numbers is shift-and-add: for each 1 in the multiplier, add a left-shifted copy of the multiplicand to a running sum. It works perfectly for unsigned values and produces a correct n×n2nn \times n \to 2n-bit product after nn iterations. The catch: it doesn’t know what to do when the operands are signed.

For two’s complement multiplication you’d have to detect signs, take absolute values, multiply, and re-apply the sign — a procedure with several edge cases (especially the most negative value, whose absolute value doesn’t fit in the same width). Andrew Booth’s 1951 algorithm sidesteps the entire problem by treating runs of consecutive 1s as a single subtract-then-add operation. It works on two’s complement values directly and, as a bonus, often executes faster.

Why Shift-and-Add Falls Short

Take the unsigned multiplication 13×1113 \times 11 in 4-bit binary:

       1101  (multiplicand A = 13)
     × 1011  (multiplier   B = 11)
     ------
       1101  (B0=1, add A)
      11010  (B1=1, add A << 1)
     000000  (B2=0, skip)
   1101000   (B3=1, add A << 3)
   --------
   10001111  (= 143)

Three additions for three 1-bits in BB. Now consider 13×1513 \times 15 where B=1111B = 1111 — four additions, all of AA shifted by 0, 1, 2, 3. As BB‘s population count grows, so does the number of additions.

Booth observed: 1111=1000011111 = 10000 - 1. So A×1111=(A4)AA \times 1111 = (A \ll 4) - A. One left-shift plus one subtract instead of four additions. For runs of 1s of length kk, the savings are k2k - 2 operations. Modern CPUs compute multiplication at billions of operations per second; the savings compound.

Booth’s Algorithm

Pad the multiplier BB on the right with an implicit zero, giving a sequence of n+1n+1 bits. Scan adjacent pairs (Bi,Bi1)(B_i, B_{i-1}) from least to most significant. At each step, examine the pair and accumulate one of three actions on the partial product PP:

BiB_iBi1B_{i-1}ActionMeaning
00PPP \leftarrow PInside a run of zeros
01PP+(Ai)P \leftarrow P + (A \ll i)End of a run of ones
10PP(Ai)P \leftarrow P - (A \ll i)Start of a run of ones
11PPP \leftarrow PInside a run of ones

After all nn bit pairs are processed, PP holds the signed 2n2n-bit product.

The intuition: when you transition from 0 to 1 (reading right to left, so Bi1=0,Bi=1B_{i-1}=0, B_i=1), you’ve started a run of 1s. Booth subtracts AA at this position — which is equivalent to adding A-A — anticipating that the run will continue. When the run ends (transition from 1 to 0, so Bi1=1,Bi=0B_{i-1}=1, B_i=0), Booth adds back the current shifted AA, completing the “add big-shift, subtract small-shift” pair that replaces the whole run of additions.

The standard implementation runs the recurrence iteratively. State variables: a 2n2n-bit accumulator PP, the multiplicand AA, the multiplier BB, and an extra “Booth bit” B1B_{-1} initially 0. Each iteration:

  1. Examine (B0,B1)(B_0, B_{-1}).
  2. Apply add, subtract, or no-op to the high nn bits of PP.
  3. Arithmetic-shift PP right by one position; the bit shifted out of PP‘s LSB becomes the new Bn1B_{n-1}, the bit shifted out of B0B_0 becomes the new B1B_{-1}.

After nn iterations, PP concatenated with BB is the signed 2n2n-bit product.

Worked Example: -3 x 5 in 4-Bit Two’s Complement

Operands: A=3=11012A = -3 = 1101_2, B=5=01012B = 5 = 0101_2. Expected product: 15=111100012-15 = 11110001_2 (8-bit two’s complement).

Initial state. Use 4-bit accumulator PP and 4-bit BB with a Booth bit appended:

StepPBB1B_{-1}Action
0000001010(initial)
1P+(A)=0000+0011=0011P + (-A) = 0000 + 0011 = 001101010(B0,B1)=(1,0)(B_0, B_{-1}) = (1, 0), subtract A
shift right (arithmetic)
000110101shifted
2P+A=0001+1101=1110P + A = 0001 + 1101 = 111010101(B0,B1)=(0,1)(B_0, B_{-1}) = (0, 1), add A
shift right
111101010shifted
3P+(A)=1111+0011=0010P + (-A) = 1111 + 0011 = 001001010(B0,B1)=(1,0)(B_0, B_{-1}) = (1, 0), subtract A
shift right
000100101shifted
4P+A=0001+1101=1110P + A = 0001 + 1101 = 111000101(B0,B1)=(0,1)(B_0, B_{-1}) = (0, 1), add A
shift right
111100010shifted

Final result: PB=111100012=1510P || B = 11110001_2 = -15_{10}. The product is correct, signed, and computed in four steps.

A couple of details worth tracing:

  • Subtract A here means add A=0011-A = 0011 (the two’s complement of 11011101). Two’s complement subtraction is just addition of the negated operand — the arithmetic foundation underlying all of Booth.
  • Arithmetic right shift preserves the sign bit. P=1110P = 1110 shifts to 11111111, not 01110111. This is essential for the algorithm to produce a correct signed result.
  • The carry from each add/subtract is discarded. Only the low nn bits of PP are retained — the result fits because of two’s complement’s modular arithmetic property.

Booth Hardware

A minimal Booth multiplier looks like:

  • A multiplicand register holding AA (and a precomputed A-A, or a circuit to negate on the fly).
  • A product accumulator holding PP (high half) and BB (low half) — typically implemented as one wide shift register that holds {P,B,B1}\{P, B, B_{-1}\} and shifts right by one each step.
  • An adder/subtractor — one ALU operating on the high half of the accumulator and either AA or A-A.
  • A 2-bit Booth decoder examining (B0,B1)(B_0, B_{-1}) and producing control signals: enable add, enable subtract, or no-op.
  • A counter to drive the FSM through nn iterations.

The control unit cycles: decode, conditionally accumulate, shift. After nn shifts, the answer sits in the accumulator. A textbook implementation needs nn clock cycles per multiplication; pipelined or array implementations process all bits in parallel and complete in one or a few cycles.

Modified Booth (Radix-4)

Standard Booth examines two bits and processes one bit per iteration. Modified Booth (also called Radix-4 Booth or Booth-2) examines three bits and processes two per iteration, halving the number of iterations. The encoding:

Bi+1B_{i+1}BiB_iBi1B_{i-1}Action
00000
001+A+A
010+A+A
011+2A+2A
1002A-2A
101A-A
110A-A
11100

The set of multiples needed is {0,±A,±2A}\{0, \pm A, \pm 2A\}. Multiplication by 2 is a left shift, so the only “real” operation per iteration is an add or subtract — no general multiplication is required. For a 32-bit operand, modified Booth runs in 16 iterations instead of 32, with the same per-step delay. This is the algorithm used inside virtually every commercial CPU multiplier.

Higher radices exist (Radix-8, Radix-16) but the multiples needed grow non-trivially — 3A3A doesn’t reduce to a simple shift — so they’re rarely worth the complexity.

Comparison: Shift-and-Add vs Booth vs Modified Booth

AlgorithmIterations (n-bit)Signed supportOperations per iter
Shift-and-add (unsigned)nnNo0 or 1 add
Two’s complement shift-and-addnnYes (with sign extension)0 or 1 add, last is subtract
BoothnnYes0, 1 add, or 1 subtract
Modified Boothn/2n/2Yes0, 1 add, or 1 subtract (one of ±A,±2A\pm A, \pm 2A)

Modified Booth’s halving of iterations, combined with native signed support, makes it the standard for fixed-point integer multiplication in CPUs.

Beyond Sequential Multipliers: Wallace and Dadda Trees

Modern high-throughput multipliers don’t iterate at all. They generate all nn partial products simultaneously using AND gates (or, for signed, modified Booth encoders), then sum them with a tree of carry-save adders. Wallace and Dadda trees are the classic structures, both achieving O(logn)O(\log n) depth. The final reduction uses a carry-lookahead adder to produce the final sum from the carry-save outputs. The whole multiplier completes in a single cycle at modern clock rates.

But these massively parallel designs still use modified Booth at the partial-product generation stage to halve the number of partial products. The algorithmic insight from 1951 is still pulling its weight in 2026.

Building a Booth Multiplier in DigiSim

A working iterative Booth multiplier needs an FSM controller, a 4-bit register acting as the accumulator, a shift register holding the multiplier with the Booth bit appended, an adder/subtractor, and a counter to terminate after nn iterations.

A concrete starting point: open the 4-bit shift register SISO template and use it as the basis for the multiplier register. Add the ALU component for the conditional add/subtract, wire in a small FSM for the controller, and you have a complete sequential Booth multiplier. The simulator’s stepper lets you single-step through the worked example above and watch each iteration produce the same intermediate states the table shows.

Common Pitfalls

  • Forgetting the Booth bit. B1B_{-1} starts at 0 and is essential for the very first iteration’s pair-bit decision. Drop it and the algorithm misfires on the LSB.
  • Logical vs arithmetic right shift. The shift on PP must be arithmetic (sign-extending). Logical shift produces a wrong result for negative accumulators.
  • Sign extension on the partial sum. Some implementations extend AA and A-A to 2n2n bits before adding to PP and skip the arithmetic shift trick. Either approach works; mixing them silently breaks correctness.
  • Modified Booth radix confusion. Don’t conflate “examines 3 bits” with “processes 3 bits per step.” Modified Booth examines a 3-bit overlapping window and processes 2 bits per step.

What’s Next

The next post in this series, Two’s Complement Explained: Signed Binary Arithmetic, covers the signed representation that makes Booth’s subtract-as-negate trick possible — and explains why A-A in two’s complement is just “invert all bits and add one.” Following that, Carry-Lookahead Adder: Faster Than Ripple-Carry attacks the addition speed problem with a different algorithmic angle.

To experiment with sequential multiplication in the simulator, open the 4-bit shift register SISO template, extend it into an 8-bit accumulator, attach an ALU for the conditional add/subtract, and step through the 3×5-3 \times 5 trace from this post. Watching the accumulator evolve cycle-by-cycle is the fastest way to internalize what Booth’s algorithm actually does.