cpu-architecture

Build a CPU from Scratch in a Simulator

Denny Denny
15 min read
Isometric exploded view of a complete simple CPU showing program counter, instruction register, ALU, registers, RAM, and control unit interconnected.

TL;DR: A working CPU is just five subsystems wired together — a register file, an ALU, a tri-state data bus, a RAM module, and a control unit that orchestrates them on each clock tick. Build each block separately, connect them through a shared bus, hand-assemble three instructions into RAM, and the fetch-decode-execute cycle takes care of the rest.

Most students meet the CPU as a black box. The textbook draws a rectangle, labels it “processor,” and moves on to assembly language. The actual machinery — the registers, multiplexers, latches, and tri-state buffers that turn a binary instruction into a state change — is hidden under abstraction. The fastest way to remove that abstraction is to build the thing yourself, gate by gate, and watch a real program run on a circuit you can probe.

This post is the long-form walkthrough. We will construct an 8-bit CPU from primitive logic gates: a four-register register file, an 8-bit ALU with a flags register, a shared data bus driven by tri-state buffers, a 256-byte RAM, and a control unit that sequences the fetch-decode-execute cycle. We will then hand-assemble three instructions, drop them into memory, and step the clock until they execute. Every component is something you have probably already seen in isolation; the trick is wiring them into a coherent machine.

Why Build a CPU Instead of Just Reading About One?

A diagram of a CPU lies to you in two ways. First, it shows everything connected at once, when in reality only one source drives the bus per clock tick. Second, it omits the control signals — the dozen-or-so wires from the control unit that decide which register loads, which register drives, and what the ALU does on this particular cycle. Until you see those signals toggle in time with the clock, the cycle is a fairy tale.

Building the CPU forces you to confront both lies. You will discover that the tri-state buffer is not a curiosity — it is the only thing standing between your bus and a short circuit. You will discover that registers are useless without explicit load-enable signals. You will discover that the control unit is not magic; it is a finite state machine driving a few dozen wires.

If you have already read the fetch-decode-execute case study and want the next layer down, this is it.

Section 0: The Architecture Overview

Before the first wire goes down, we need a block diagram. Our target CPU has these characteristics:

  • 8-bit data path — every register, the ALU, and the bus carry 8 bits.
  • 8-bit address space — 256 bytes of RAM, plenty for our toy programs.
  • 16-bit fixed-width instructions — two memory bytes per instruction.
  • Four general-purpose registers — R0, R1, R2, R3.
  • Two specialised registers — Accumulator (A) and a temporary B for the ALU.
  • Single shared data bus — only one component drives at a time.
  • Synchronous design — everything is clocked off a single global clock.

The block diagram, drawn as a table:

SubsystemComponentsPurpose
Program counterPROGRAM_COUNTER_8BITHolds address of next instruction
Memory interfaceMEMORY_ADDRESS_REGISTER, MDRRead/write to RAM
MemoryRAM (256×8256 \times 8)Stores program + data
DecodeINSTRUCTION_REGISTERHolds current instruction
Register fileR0–R3, A, BGeneral storage
ComputeALU_8BIT, FLAGS_REGISTERArithmetic + logic
InterconnectDATA_BUS_8BIT, TRI_STATE_BUFFERsShared 8-bit transport
SequencerCONTROL_UNITGenerates timed control signals

Picture them as a row of boxes, all hanging off a single horizontal bus, each with one wire that says “drive the bus” and one that says “load from the bus.” The control unit owns those wires.

The DigiSim template that matches this architecture is Sequential Instruction Executor — open it in another tab and use it as the reference machine while reading.

Section 1: Build the Register File

A register is a parallel array of D flip-flops, all sharing a clock and a load-enable signal. We covered this in detail in Mastering the 4-bit Register; the 8-bit version is identical, just twice as wide.

The Generic 8-Bit Register

Each register has:

  • 8 data inputs (D0D7D_0 \ldots D_7)
  • 1 clock input (CLK)
  • 1 load-enable input (LD)
  • 8 data outputs (Q0Q7Q_0 \ldots Q_7)
  • 8 tri-state output-enable bits, gated by an output-enable wire (OE)

The load-enable usually feeds an AND gate with the clock (a gated clock — fine in a simulator, problematic on real silicon, but we will keep things simple). The output-enable feeds a bank of tri-state buffers that connect QQ to the bus.

A useful internal-link recap on flip-flop fundamentals: SR vs JK flip-flops, the JK flip-flop as universal building block, and setup/hold timing — the last one will matter when you wonder why your register sometimes catches the wrong value.

The Registers We Need

Drop the following register components into your simulator and label them:

  1. PC (Program Counter) — special: it has an increment input as well as a load. Use the PROGRAM_COUNTER_8BIT component which bundles the counter + load mux.
  2. IR (Instruction Register) — 16 bits wide, since our instructions are 16 bits. Use two REGISTER_8BIT components stacked, or the dedicated INSTRUCTION_REGISTER.
  3. MAR (Memory Address Register) — 8 bits, drives the RAM’s address pins.
  4. MDR (Memory Data Register) — 8 bits, sits between RAM and bus.
  5. A (Accumulator) — 8 bits, ALU’s left input. The dedicated ACCUMULATOR component handles this.
  6. B (Temp) — 8 bits, ALU’s right input.
  7. R0, R1, R2, R3 — general purpose, four 8-bit registers.

That’s ten logical registers (PC, IR, MAR, MDR, A, B, R0–R3). Each exposes two control wires to the bus: a load wire (load_X) and an output-enable wire (oe_X). That is 20 wires for our ten registers, plus the special PC-increment wire, plus a global clock — about 22 control signals in total. Hold that number in your head; the control unit will produce all of them.

Register File Diagram

       Bus[7:0] (shared 8-bit data bus)
        |  |  |  |  |  |  |  |
   +----+--+--+--+--+--+--+--+----+
   |     R0  R1  R2  R3  A   B   |
   |     PC  IR  MAR MDR         |
   +-----------------------------+
       ^                    ^
       |                    |
    load_X (per reg)    oe_X (per reg)

Test the file in isolation: assert oe_R0, leave everything else at ZZ (high-impedance), and verify the bus reads R0’s contents. Now drop oe_R0, raise oe_R1, and the bus should switch to R1. If two oe_X lines go HIGH simultaneously, the bus will short — DigiSim will flag the conflict, which is the point of using tri-state buffers in the first place.

Section 2: Build the 8-Bit ALU

The ALU is the only subsystem in our CPU that does combinational work — there are no flip-flops inside it. Its job is to take two 8-bit inputs (from A and B), perform an operation selected by a 3-bit op code, and produce an 8-bit result plus four status flags.

Operations We Want

op[2:0]MnemonicResult
000ADDA+BA + B
001SUBABA - B
010ANDABA \wedge B
011ORABA \vee B
100XORABA \oplus B
101NOTA\overline{A}
110SHLA1A \ll 1
111SHRA1A \gg 1

The full theory of how an ALU is built up from gates lives in the upcoming pillar How an ALU Works: Arithmetic Logic Unit from Gates. For our build, the high-level recipe is:

  1. Adder lane — eight cascaded full adders, each implementing S=ABCinS = A \oplus B \oplus C_{in}, Cout=AB+Cin(AB)C_{out} = AB + C_{in}(A \oplus B). This is exactly the 4-bit ripple-carry adder doubled.
  2. Subtraction is addition with B inverted and the carry-in forced to 1 — that is two’s complement, covered in Two’s Complement: Signed Binary Arithmetic Explained.
  3. Logic lane — eight parallel AND/OR/XOR gates per bit, plus an inverter for NOT.
  4. Shifter lane — fixed wiring that maps bit ii to bit i+1i+1 (left) or i1i-1 (right) with zero-fill.
  5. Output multiplexer — an 8-way MULTIPLEXER per bit, selected by op[2:0], picks which lane drives the result.

Flags Register

Four flags fall out almost for free:

  • Zero (Z): OR all eight result bits together, then invert. Z=R7+R6++R0Z = \overline{R_7 + R_6 + \cdots + R_0}.
  • Negative (N): copy the result’s MSB. N=R7N = R_7.
  • Carry (C): the carry-out of the topmost full adder.
  • Overflow (V): XOR of the carry-into-MSB and the carry-out-of-MSB. The reasoning lives in the upcoming CPU Flags Register post.

Wire those four signals into a 4-bit register clocked alongside the ALU. That is your FLAGS_REGISTER. It only loads on instructions that produce flags (ADD, SUB, AND, OR, XOR, NOT) — the control unit asserts load_flags only on those opcodes.

If you want to skip the gate-level grind, the 4-bit ALU demonstration and 8-bit ALU system templates are working references you can clone and inspect.

Section 3: Build the Data Bus with Tri-State Buffers

Up to now we have been hand-waving about “the bus.” Time to make it real.

A bus is just eight wires. The challenge is that we have nine candidate sources (PC, IR, MAR, MDR, A, B, R0, R1, R2, R3, ALU result) and only one can drive at a time. The conventional answer is the tri-state buffer: a buffer with a third state, Z (high-impedance), that effectively disconnects its output from the wire when the enable is LOW.

The Bus Discipline

Three rules:

  1. At most one driver per clock cycle. The control unit guarantees this by asserting at most one oe_X signal per cycle.
  2. Loaders sample on the clock edge. Loads happen synchronously; drives happen “during” the cycle.
  3. Idle = floating. If nothing drives the bus, every bit floats at Z. That is fine — just do not load it.

Each register-to-bus connection is a bank of eight tri-state buffers, all sharing the register’s oe_X enable. The ALU result also goes through tri-state buffers (call its enable oe_alu). The MDR’s tri-state path goes both ways — bus to MDR for stores, MDR to bus for loads — which means MDR needs two output-enable wires, one for each direction.

Validation

Pick any two registers and a clock. Put hex 0xAA in R0 and 0x55 in R1. On cycle 1, assert oe_R0 only — the bus should read 10101010. On cycle 2, assert oe_R1 only — bus reads 01010101. Assert both at once: DigiSim should flag a bus conflict (both buffers driving). That conflict is the simulator catching what would, on real silicon, be a smoking chip.

We will go deeper on bus arbitration in Tri-State Buffers and Bus Arbitration Explained.

Section 4: Build the RAM Module

Our RAM is 256 bytes — 256×8256 \times 8 bits. The RAM component already wraps this for you, but it helps to know what is inside.

Conceptually, a RAM cell is a D-latch — one bit of storage that captures its input when the write-enable is asserted. Eight of those in parallel make a byte. 256 of those bytes, addressable by an 8-bit address, make our memory. The address goes into a DECODER that picks one of 256 rows; the picked row is connected to the data lines through tri-state buffers (read) and a bank of write-enable gates (write).

Interface

The RAM in our CPU has these pins:

  • ADDR[7:0] — driven by MAR
  • DATA[7:0] — bidirectional, hangs off the bus through the MDR
  • WE — write-enable, asserted by the control unit during stores
  • OE — output-enable, asserted during loads

For our walkthrough we will pre-load the RAM by hand. In DigiSim you can right-click the RAM and paste hex values into specific addresses. Modeling the bring-up sequence is out of scope here — for that, the RAM with address control and basic RAM memory system templates have working harnesses.

We will compare RAM to ROM in the upcoming RAM vs ROM post — the short answer is that ROM has no WE and no D-latches inside.

Section 5: Build the Control Unit

The control unit is where most students stall. It is also the thing that finally makes the CPU feel like a CPU — without it, the registers and ALU just sit there.

What the Control Unit Does

It produces, on every clock cycle, the right combination of those ~20 control wires (load-enables, output-enables, ALU op code, RAM read/write) to advance the CPU through its current instruction.

That is it. There is no mystery. The control unit is a finite state machine. Its inputs are:

  • The opcode bits from the IR
  • The flags from the flags register (for conditional branches)
  • The current cycle within the instruction

Its outputs are the ~20 control wires.

State Encoding

We will use a microcode approach: a small ROM where each address corresponds to a (opcode, microstep) pair, and each output word is the bit-pattern for the 20 control wires.

ROM input (8 bits):     ROM output (20 bits):
[opcode][microstep]  →  [load_PC, load_MAR, oe_PC, ..., alu_op[2:0], we_RAM, ...]

A 4-bit opcode and a 4-bit microstep counter give a 256-entry microcode ROM. Most entries are unused; that is fine.

The microstep counter is just a small counter that increments every clock, resetting to 0 when the instruction completes. The reset signal is itself a control wire (end_instruction).

A Fetch-Cycle Example

Every instruction begins with the same three microsteps — the fetch sequence:

MicrostepActionActive wires
0MARPCMAR \leftarrow PCoe_PC, load_MAR
1MDRRAM[MAR]MDR \leftarrow RAM[MAR], PCPC+1PC \leftarrow PC + 1oe_RAM, load_MDR, inc_PC
2IRMDRIR \leftarrow MDRoe_MDR, load_IR

After microstep 2, the IR holds the instruction. The control unit reads its opcode and jumps to the opcode-specific microsteps. We will see those next.

For an alternate FSM encoding (one-hot, gate-based) see Counters and State Machines: Controlling Digital Sequences. The ROM-based microcode approach is what real CISC chips like the 6502 used.

Section 6: The Fetch-Decode-Execute Cycle

We have all the pieces. Time to wire them up.

Instruction Format

Two 16-bit formats. (Same as the fetch-decode-execute case study, so the encoding feels familiar.)

Format A — Register mode (ADD, SUB, AND, OR, XOR):

| 15 14 13 12 | 11 10 9 8 | 7 6 5 4 | 3 2 1 0 |
|   OPCODE    |   REG_A   |  REG_B  |  REG_D  |

Format B — Immediate mode (LDI, STI, JMP):

| 15 14 13 12 | 11 10 9 8 |   7 6 5 4 3 2 1 0   |
|   OPCODE    |   REG_D   |    8-BIT IMMEDIATE  |

Our minimal opcode set:

OpcodeMnemonicFormatMeaning
0000NOPNo operation
0001LDIBRDIMMR_D \leftarrow \text{IMM}
0010LDBRDRAM[IMM]R_D \leftarrow \text{RAM}[\text{IMM}]
0011STBRAM[IMM]RD\text{RAM}[\text{IMM}] \leftarrow R_D
0100ADDARDRA+RBR_D \leftarrow R_A + R_B
0101SUBARDRARBR_D \leftarrow R_A - R_B
0110JMPBPCIMMPC \leftarrow \text{IMM}
0111JZBIf Z=1, PCIMMPC \leftarrow \text{IMM}
1000HLTStop the clock

Microsteps for ADD

After the universal fetch (microsteps 0–2):

MicrostepActionActive wires
3ARAA \leftarrow R_Aoe_R_A, load_A
4BRBB \leftarrow R_Boe_R_B, load_B
5ALU computes A+BA + B, RDR_D \leftarrow resultalu_op = 000, oe_alu, load_R_D, load_flags
6End instructionend_instruction

Note oe_R_A is itself decoded from the REG_A field of the IR — typically by a 2-to-4 decoder wired to drive whichever register the IR points at.

Microsteps for LDI

After fetch:

MicrostepActionActive wires
3RDR_D \leftarrow IR[7:0] (the immediate)oe_IR_imm, load_R_D
4End instructionend_instruction

oe_IR_imm is just a tri-state path from the IR’s lower 8 bits to the bus.

Microsteps for JMP

After fetch:

MicrostepActionActive wires
3PCPC \leftarrow IR[7:0]oe_IR_imm, load_PC
4End instructionend_instruction

For JZ, microstep 3 is conditional: assert load_PC only if the Z flag is HIGH. That conditional is one AND gate between Z and the microcode bit.

Section 7: Run a Three-Instruction Program

Time to actually run something. Our program: load 5 into R0, load 3 into R1, add them, store the result at memory address 0x10.

Hand-Assembly

AddressInstructionEncoding (hex)
0x00LDI R0, #51 0 050x1005
0x02LDI R1, #31 1 030x1103
0x04ADD R2, R0, R14 0 1 20x4012
0x06ST R2, #0x103 2 100x3210
0x08HLT8 0 000x8000

Each instruction is two bytes; the PC increments by 2 per fetch (or you store each byte at consecutive addresses and the PC increments by 1 per fetch over a two-fetch sequence — implementation detail).

Loading and Stepping

In DigiSim, open Sequential Instruction Executor and:

  1. Right-click the RAM, paste in the five instructions at addresses 0x00 through 0x09.
  2. Set the PC to 0x00.
  3. Click the clock once. Watch microstep 0 light up: PC drives the bus, MAR loads.
  4. Click again. Microstep 1: RAM drives the bus, MDR loads, PC increments.
  5. Click again. Microstep 2: MDR drives the bus, IR loads. Now the IR holds 0x1005.
  6. The control unit decodes opcode 0001 (LDI) and jumps to LDI’s microsteps.
  7. Microstep 3: the immediate byte (0x05) drives the bus, R0 loads. R0 now holds 5.
  8. End instruction; PC is already at 0x02.

Walk through the rest at your own pace. By the time you reach the HLT at 0x08, R2 should hold 8 (5+35 + 3) and RAM[0x10] should also hold 8.

If something is wrong, the diagnostic loop is always:

  1. Pick the cycle where it broke.
  2. Check which control wires are HIGH.
  3. Compare to the table above.
  4. Trace backward from the wrong wire to its microcode entry.

This is the same loop a real chip designer runs in a waveform viewer. You are doing real work.

Common Mistakes (and How to Spot Them)

SymptomLikely cause
Two registers seem to hold the same value foreverTwo load_X wires driven by the same microcode bit — one was meant to be elsewhere
Bus reads XXXXXXXX (conflict)Two oe_X HIGH at once — usually a microcode typo
Bus reads ZZZZZZZZ (floating)No oe_X HIGH — register-mode instruction missing the drive step
PC jumps to wrong addressOff-by-one in increment vs. load order
Flags wrong after ADDload_flags not asserted, or asserted on wrong microstep
ADD result sign-extends incorrectlyYou are reading the result as signed but storing as unsigned — review two’s complement
Instruction never completesend_instruction wire not wired into microstep counter reset

Where to Go Next

You have built a CPU. It is a small one, but every concept inside it scales: real CPUs use the same register file + ALU + bus + control unit pattern, just wider, faster, and pipelined. Specific next steps:

  • Extend the instruction set. Add JZ, AND, OR, XOR, NOT, SHL, SHR. Each is one new microcode entry per microstep.
  • Add interrupts. A second PC (“interrupt vector”) and a flip-flop that latches an external IRQ signal — when it is HIGH and interrupts are enabled, the control unit performs an extra microstep that swaps in the vector.
  • Pipeline it. Split fetch, decode, execute into three concurrent stages. You will need stage-boundary registers (the kind covered in propagation delay) and you will discover hazards.
  • Read the deep dive. Our upcoming How a Microprocessor Works: Fetch-Decode-Execute Deep Dive takes the same architecture, adds pipelining and a richer ISA, and runs a 50-instruction sort program.

The biggest lesson from this build is not any one circuit; it is that the CPU is all the circuits at once, sequenced by a clock and a small ROM. Once that clicks, every architecture textbook reads differently.

Try It Yourself

Open the Sequential Instruction Executor template, paste in the three-instruction program from Section 7, and step the clock by hand. Then try the 4-bit ALU demonstration and the 4-bit register with clock templates if any specific block needs extra attention. Build the CPU flags register once, then drop it into your CPU as the FLAGS_REGISTER block — the wiring is the same.

When you are ready to go deeper on the math the ALU is doing, the natural next read is Two’s Complement: Signed Binary Arithmetic Explained — that one post will retire most of the lingering “but what does the MSB mean?” confusion left over from this build.