How an ALU Works: Arithmetic Logic Unit from Gates
TL;DR: An Arithmetic Logic Unit (ALU) is a combinational circuit inside a CPU that performs arithmetic operations (addition, subtraction) and bitwise logic operations (AND, OR, XOR, NOT, shifts) on two binary operands, selected by control inputs. It outputs a result word plus status flags — typically Zero, Carry, Negative, and Overflow — that other parts of the CPU use for branching and condition checks.
The Arithmetic Logic Unit is the calculator at the center of every CPU. Each ADD, SUB, AND, OR, XOR, CMP, or shift instruction your processor executes ultimately resolves to a control pattern fed into the ALU, which produces a result and a set of flags in a single clock cycle. Despite its central role, an ALU is purely combinational logic: an adder, a comparator, a few bitwise gate banks, a shifter, and a multiplexer that picks which result to forward. Understanding how those pieces fit together demystifies the entire CPU datapath.
What is an ALU?
An ALU is a combinational digital circuit that takes two N-bit operands (commonly called A and B), a small set of control bits that select an operation, and produces an N-bit result along with status flags. It contains no internal state — every output is a pure function of the current inputs. State is held outside the ALU, in registers and the flags register that captures the ALU’s flag outputs at the end of each cycle.
A typical 4-bit ALU exposes the following ports:
| Port | Width | Direction | Purpose |
|---|---|---|---|
| A | 4 | input | First operand |
| B | 4 | input | Second operand |
| Cin | 1 | input | Carry-in for arithmetic |
| Op | 3-4 | input | Operation select |
| Y | 4 | output | Result |
| Cout | 1 | output | Carry / borrow out |
| Z | 1 | output | Zero flag (Y = 0) |
| N | 1 | output | Negative flag (Y[3]) |
| V | 1 | output | Overflow flag (signed) |
The opcode width determines how many distinct operations the ALU supports. Three opcode bits give 8 operations; four bits give 16. The classic 74181 4-bit ALU IC, shipped in 1970 and used in the PDP-11, DG Nova, and many minicomputers of that era, exposed a 4-bit mode-and-select scheme that yielded 16 arithmetic and 16 logic operations.
The five blocks inside a 4-bit ALU
A practical ALU is built from five functional blocks, all running in parallel. A final multiplexer selects which block’s output becomes the visible result.
1. Adder / subtractor
The arithmetic block is a 4-bit ripple-carry adder built from full adders. Subtraction is implemented by XORing B with a Sub control line and feeding that line into Cin: when Sub = 1, B is inverted and Cin = 1, producing two’s complement A + (~B) + 1 = A - B. This single trick lets one adder do both ADD and SUB without doubling the silicon. For deeper detail on the adder structure, see Mastering Binary Addition: Building a 4-Bit Ripple-Carry Adder and The Half Adder vs the Full Adder.
The arithmetic block also produces:
- Cout — the final carry, used directly as the C flag
- V — overflow, computed as (the XOR of the carry into the MSB and the carry out of the MSB)
2. Bitwise logic block
A bank of 4 AND gates, 4 OR gates, 4 XOR gates, and 4 NOT gates produces all four bitwise results in parallel. Because each bit position is independent, no carry propagation is needed and these results are available almost instantaneously — typically a single gate delay. The control logic then picks one of these four buses through the output multiplexer.
| Op bits | Operation | Bit-i logic |
|---|---|---|
| 000 | ADD | (full adder) |
| 001 | SUB | (full adder, B inverted, Cin=1) |
| 010 | AND | A_i AND B_i |
| 011 | OR | A_i OR B_i |
| 100 | XOR | A_i XOR B_i |
| 101 | NOT A | NOT A_i |
| 110 | SHL | A shifted left 1 |
| 111 | SHR | A shifted right 1 |
3. Shifter
A barrel shifter or a simple 1-bit shifter rewires A to produce a left-shift or right-shift result. A 1-bit left shift wires Y[i] = A[i-1] with Y[0] = 0. A 1-bit right shift wires Y[i] = A[i+1] with Y[3] = 0 (logical) or Y[3] = A[3] (arithmetic, sign-extending). No gates are required for a fixed-distance shift — it is pure rewiring inside the cell.
4. Comparator
The magnitude comparator drives the equality and ordering flags used by branch instructions like JE, JL, and JG. In simple ALUs the comparator is omitted and the same job is done by subtracting and reading the Z, N, and V flags: A = B when Z = 1, A < B (signed) when N XOR V = 1. In larger ALUs a dedicated comparator runs in parallel for speed. See Digital Comparator Explained: Equality and Magnitude for the standalone circuit.
5. Output multiplexer
A 4-bit-wide 8-to-1 multiplexer (one MUX per bit lane) selects which block’s output drives Y. The Op control bits feed the select inputs. This is the gate that makes the ALU programmable — every cycle, the same hardware can be told to do something different.
Status flags
The flags register captures the ALU’s flag outputs at the end of the cycle. The four canonical flags are:
- Z (Zero) — set when every bit of Y is 0. Computed as a 4-input NOR of Y: .
- C (Carry / Borrow) — the carry out of the MSB during arithmetic. For subtraction, C = 0 indicates a borrow occurred (depending on convention).
- N (Negative / Sign) — the MSB of Y, treated as the sign bit in two’s complement: .
- V (Overflow) — set when a signed arithmetic result wraps. — the XOR of the carry into the MSB and the carry out of the MSB.
Branch instructions read these flags. JZ branches when Z = 1, JC branches when C = 1, JL (signed less than) branches when N XOR V = 1, and so on. The flags are also why the ALU result is wired into the output mux even on a CMP instruction: CMP is just a SUB whose result is discarded but whose flags are kept.
How the ALU sits in the CPU datapath
Inside a textbook CPU the ALU is wired between the register file and the result bus:
- The control unit decodes the current instruction and asserts the Op bits.
- Two registers drive A and B onto the ALU’s operand ports.
- The ALU produces Y and the flags in one combinational delay.
- On the next clock edge, Y is latched into the destination register and the flags are latched into the flags register.
Because the ALU is combinational, its propagation delay sets a hard floor on the CPU’s clock period. Every nanosecond shaved off the ripple-carry chain (via carry-lookahead, carry-select, or carry-skip schemes) directly raises the maximum clock rate. To see the ALU in the broader execution pipeline, see Case Study: Visualizing the Fetch-Decode-Execute Cycle and Mastering the 4-Bit Register.
A worked 4-bit example
Suppose A = 0110 (6), B = 0011 (3), and Op = 001 (SUB).
- The XOR/Cin trick inverts B to 1100 and forces Cin = 1.
- The adder computes 0110 + 1100 + 1 = 1 0011, so Y = 0011 (3) and Cout = 1.
- Z = 0 (result is non-zero), N = 0 (MSB is 0), C = 1 (no borrow), V = 0.
Now try A = 0011, B = 0110, Op = 001 (SUB):
- B inverted = 1001, Cin = 1.
- 0011 + 1001 + 1 = 0 1101, so Y = 1101, Cout = 0.
- Z = 0, N = 1 (negative — that’s -3 in two’s complement), C = 0 (borrow occurred), V = 0.
The signed-less-than test N XOR V evaluates to 1, correctly indicating 3 < 6.
Real-world ALUs: 74181 to modern superscalar
The Texas Instruments 74181 from 1970 was the first single-chip 4-bit ALU. It packed roughly 75 gates on one die, supported 16 arithmetic and 16 logical operations, and could be cascaded with the matching 74182 carry-lookahead chip to build 8-, 16-, or 32-bit ALUs. Several minicomputer designs of that era cascaded 74181s in their ALUs — a 16-bit ALU could be built from four chips, a 32-bit ALU from eight. Modern x86 and ARM cores use multiple parallel ALUs (often 3–4 integer ALUs plus separate FPUs and SIMD units) per core, all sharing the same fundamental block diagram: bitwise gate banks, an adder, a shifter, and a result mux.
The execution width has grown from 4 bits to 64, the carry network has moved from ripple to lookahead to Kogge-Stone, and the operation count has grown from 32 to hundreds — but the picture is the same picture the 74181 drew in 1970.
Build it and watch the flags flip
The four flags only become intuitive when you feed live operands and watch them light up. Open the 4-Bit ALU Demonstration to step through ADD, SUB, AND, OR, and XOR with controllable operands and a flag readout, then move to the 8-Bit ALU System to see how the same blocks scale up. Once the ALU is clear, the next stop is the CPU flags register, which captures Z, C, N, and V and feeds them to the branch logic that turns arithmetic results into program flow.