IEEE 754 Floating Point: From Bits to Numbers
TL;DR: IEEE 754 represents real numbers as , with a sign bit, a biased exponent, and a mantissa whose leading 1 is implicit. The format reserves specific exponent values for zero, infinity, NaN, and denormals — and the rounding rules are why floating-point arithmetic isn’t associative.
Fixed-point binary works fine for counting, addressing, and basic accumulation. It breaks the moment you need to multiply two values whose magnitudes differ by many orders. A 32-bit fixed-point number with 16 fractional bits can represent values up to about with precision — fine for many embedded uses, useless for representing the gravitational constant () or a galactic mass ( kg) in the same calculation.
Floating-point representations solve this by allocating bits to an exponent, letting the radix point “float” along the magnitude axis. IEEE 754, ratified in 1985 and last revised in 2019, is the universal standard. Every commodity CPU implements it; every numerical library depends on its bit-level reproducibility.
Why Floating Point at All?
Scientific notation handles the same problem on paper: separates a small significand from a large exponent. IEEE 754 is binary scientific notation with strict rules about how many bits each part gets and how they’re encoded.
Two design pressures shape the format:
- Single, well-defined bit pattern per number. Comparisons and equality must be unambiguous.
- Fast hardware decoding. Exponent comparison and significand alignment must be possible with simple combinational logic in a floating-point unit (FPU) pipeline — a register-rich pipeline conceptually similar to the 4-bit register chains that anchor a CPU datapath.
The trade-offs taken to achieve both — biased exponents, implicit leading bits, special-purpose exponent codes — give IEEE 754 its distinctive shape.
The Single-Precision Layout (32-bit)
| 31 | 30 ............ 23 | 22 ........................... 0 |
| s | e (8 bits) | m (23 bits) |
| 1 | | |
- Sign bit : 1 bit. 0 means non-negative, 1 means negative.
- Exponent : 8 bits, stored as an unsigned integer with a bias of 127.
- Mantissa (or significand) : 23 bits, representing the fractional part of the significand; the integer part is implicit.
For normalized numbers (the common case), the value is:
The "" in the significand is implicit — the format saves a bit by exploiting the fact that every normalized binary number can be written with a leading 1 by choice of exponent.
The Double-Precision Layout (64-bit)
Same structure, wider fields:
| 63 | 62 ........... 52 | 51 ............................. 0 |
| s | e (11 bits) | m (52 bits) |
- Sign bit : 1 bit.
- Exponent : 11 bits, bias 1023.
- Mantissa : 52 bits.
Value (normalized):
Range: roughly with about 15-17 significant decimal digits.
Why Bias the Exponent?
Storing the exponent as rather than two’s complement has a subtle but important benefit: floating-point comparison reduces to integer comparison on the bit pattern (with a sign-bit flip). If two non-negative floats are compared as 32-bit unsigned integers, the larger float is the larger integer. Two’s complement exponents would break this monotonic relationship at the boundary between positive and negative exponents.
The bias also makes (all-zero exponent bits) representable as the most negative exponent — which the standard reserves for denormals — and (all-ones) the most positive, reserved for infinities and NaN. The “useful” normalized range is , corresponding to actual exponents to for single precision.
The Special Cases
| Exponent bits | Mantissa bits | Meaning |
|---|---|---|
| All zeros | All zeros | (sign bit determines which) |
| All zeros | Non-zero | Denormal (subnormal) |
| Anything in middle | Anything | Normalized number |
| All ones | All zeros | |
| All ones | Non-zero | NaN (Not a Number) |
Zero has two encodings ( and ). They compare equal under IEEE comparison but produce different signed results in some operations (e.g., , ).
Denormals (also called subnormals) fill the gap between and the smallest normalized number. The implicit leading 1 is replaced with a leading 0, and the exponent is fixed at (single precision). This gives gradual underflow — the precision degrades smoothly toward zero rather than jumping suddenly.
Infinity results from overflow or division by zero. It propagates through arithmetic in the obvious ways: , .
NaN signals an invalid operation: , , . NaN is contagious — any arithmetic involving NaN yields NaN. Critically, in IEEE 754 comparison, which is why test suites use isnan() rather than ==.
Worked Example: Decoding 0x40490FDB
A famous 32-bit constant. In binary:
0x40490FDB = 0100 0000 0100 1001 0000 1111 1101 1011
Split into fields:
- Sign:
0— positive - Exponent:
1000 0000= — actual exponent - Mantissa:
100 1001 0000 1111 1101 1011
The significand is , i.e.
Compute the mantissa as a fraction:
So the significand is . Multiply by :
That’s to single-precision accuracy. The constant 0x40490FDB is the canonical IEEE 754 single-precision encoding of and appears in numerical libraries, GPU shaders, and exam questions worldwide.
Worked Example: Encoding 5.75
Convert to single-precision IEEE 754.
Step 1: convert to binary. . . Combined: .
Step 2: normalize. Shift the binary point left until exactly one digit remains to its left:
So the actual exponent is .
Step 3: encode each field.
- Sign: positive, so .
- Exponent: actual + bias = =
1000 0001. - Mantissa: drop the leading 1, take the next 23 bits. Mantissa bits =
0111followed by 19 zeros:01110000 00000000 0000000.
Step 4: assemble.
0 | 10000001 | 01110000000000000000000
In hex: 0x40B80000. You can verify by reversing the decoding procedure on this pattern and recovering 5.75 exactly.
Note that 5.75 has an exact binary representation — its fractional part is a sum of negative powers of two (). Most decimal fractions don’t. in single precision is 0x3DCCCCCD, which decodes back to , off by about . That tiny error is the root cause of every "" surprise newcomers run into.
Rounding Modes
IEEE 754 defines four rounding modes:
- Round to nearest, ties to even (default). Round to the closer representable value; on a tie, round to the value whose mantissa LSB is 0.
- Round toward zero (truncation).
- Round toward (ceiling).
- Round toward (floor).
The default mode minimizes statistical bias — straightforward “round half up” introduces a slight upward bias when many values are accumulated, while round-half-to-even averages out.
The rounding mode is part of the FPU’s runtime state, switchable per-thread on most CPUs. Numerical code that needs reproducibility across machines must lock in the rounding mode explicitly.
Why Floating-Point Arithmetic Isn’t Associative
Consider three IEEE 754 single-precision values: .
: First, exactly (the cancellation is clean). Then . Final result: .
: First, . The exponent of is around 67; the exponent of is 0. To add, ‘s significand is shifted right by 67 bit positions to align with ‘s exponent — which shifts every bit of off the end of the 23-bit mantissa. So in single precision. Then . Final result: .
The two parenthesizations produce different answers. Floating-point addition is not associative — it’s only approximately associative when all values have similar magnitudes. Compiler optimizations that reorder floating-point sums are forbidden by default in C and C++ (-fno-associative-math is the safe setting). Reproducibility-sensitive code (financial software, scientific reproduction, blockchain consensus) treats float reordering as a correctness bug.
The same property applies to subtraction (cancellation cases), multiplication (extremely small products may underflow to zero), and division (division by very small numbers may overflow to infinity).
Hardware: The FPU
Every modern CPU includes a floating-point unit. The sequence for a single-precision add looks like:
- Decode both operands’ fields.
- Compare exponents; shift the smaller-exponent significand right by the difference, with a guard, round, and sticky bit retained for rounding.
- Add or subtract the aligned significands (using a wide carry-lookahead-style adder).
- Normalize the result by left-shifting until a leading 1 emerges; adjust the exponent accordingly.
- Round per the active rounding mode.
- Encode the result into IEEE 754 format.
The hardware uses an ALU-style integer pipeline for the significand arithmetic, plus dedicated shifters for alignment and normalization. RISC-V’s F extension defines 32-bit single-precision floats; the D extension adds 64-bit doubles. Both are optional — embedded RISC-V cores routinely omit them, and software libraries like softfloat provide emulation when hardware is absent.
For wider data paths the underlying integer arithmetic borrows directly from the binary techniques in the 3,500-year history of binary arithmetic and the algebraic foundations covered in mastering Boolean algebra. The FPU is just an integer ALU with extra plumbing.
Wider Formats
Beyond single and double precision, IEEE 754 defines:
- Half precision (16-bit): 1 sign + 5 exponent + 10 mantissa. Used in graphics shaders and increasingly in machine-learning inference where low precision is acceptable.
- Quadruple precision (128-bit): 1 + 15 + 112. Rare; used in numerical analysis and some financial computation.
- bfloat16 (Google’s variant, not strictly IEEE 754): 1 + 8 + 7. Same exponent range as single precision but tiny mantissa. Designed for ML training where exponent range matters more than precision.
Each variant trades off range against precision for its target workload.
Common Pitfalls
- Comparing floats with
==. Floating-point arithmetic accumulates rounding error. Tests should use a tolerance. - Forgetting that
NaN != NaN. Useisnan()to detect NaN. - Assuming associativity. Reordering sums can change the answer. Compiler flags can enable unsafe reassociation; understand them before using them.
- Mixing decimal and binary intuitions. has no exact binary representation. If you need exact decimal arithmetic (currency), use a fixed-point or BCD (binary-coded decimal) representation instead.
- Treating denormals as free. Many CPUs handle denormal arithmetic in microcode at 10x to 100x the throughput cost of normalized arithmetic. High-performance code often disables denormals (
flush-to-zeromode).
Connecting to Wider CPU Architecture
A general-purpose register file built from 4-bit registers holds integer operands. The FPU has its own register file — typically 32 single or double-precision registers — and its own datapath, but it communicates with the integer side through load/store instructions. The boundary between integer and float is one of the cleanest separations in modern CPU design, which is why it’s possible to omit the FPU entirely on minimal embedded cores.
For wider arithmetic the integer side relies on something like the 8-bit adder extended to 32 or 64 bits. The float side adds the alignment, normalization, and rounding stages on top.
What’s Next
The next post in this series, Two’s Complement Explained: Signed Binary Arithmetic, covers the signed integer encoding that the floating-point significand’s sign bit conceptually mirrors — and explains why two’s complement is the universal choice for integer arithmetic in CPUs. Following that, BCD: Binary-Coded Decimal Fundamentals covers the alternative numeric representation used for exact decimal arithmetic in cases where IEEE 754’s binary fractions cause unacceptable rounding.
To experiment with floating-point at the bit level, build an ALU-based integer datapath in DigiSim wide enough to hold a 32-bit operand, then implement the alignment and normalization shifts as discrete steps. Stepping through an FP addition cycle-by-cycle clarifies why the FPU is one of the most complex blocks in any modern CPU.