Jolt Book | ePrint 2023/1217 | a16z, 2023 Authors: Arasu Arun, Srinath Setty, Justin Thaler

Jolt is a RISC-V zkVM built around sum-check and lookup arguments: Just One Lookup Table. As of v0.2.0, it uses Twist & Shout for memory checking / lookups, Spartan for R1CS, and Dory as the polynomial commitment scheme. It supports RV64IMAC.

Instruction Fetch (Bytecode)

Source: Jolt Book - Bytecode

During preprocessing, the ELF binary is decoded into a table of instructions. Each entry is a tuple:

(rs1, rs2, rd, imm, circuit_flags, lookup_table_flags, address)

where:

rs1, rs2: Source register indices (which of the 64 registers to read).
rd: Destination register index (which register to write the result to).
imm: Immediate value embedded in the instruction.
address: The instruction’s ELF memory address (unexpanded_pc). This is distinct from the bytecode index $k$ (its position in the preprocessed table).
circuit_flags: Booleans consumed by the R1CS constraints. They encode the instruction type for the constraint system. Examples:
- jump_flag (JAL, JALR)
- branch_flag
- load_flag (LB, LW, …)
- add_operands (ADD, ADDI, AUIPC — instructions whose operands are added in the field)
- is_rd_not_zero, write_lookup_to_rd
- left_is_rs1, left_is_pc, right_is_rs2, right_is_imm (which values to feed as the left/right operand)
- is_noop, virtual_instruction, is_first_in_sequence
lookup_table_flags: Booleans that select which instruction lookup table to query during instruction execution (see below). There is one flag per supported instruction (e.g. one for XOR, one for AND, one for SLT, …).

At each cycle, the current PC indexes into this table to “fetch” the instruction. This is a read-only lookup, so it uses a Shout instance. The $Val$ polynomial is a random linear combination of all the fields above, so that a single read-check proves the entire tuple at once.

Registers

Source: Jolt Book - Registers

Jolt has 64 registers (32 real RISC-V + 32 virtual):

Registers	Purpose
0–31	Standard RISC-V registers (x0 hardwired to zero)
32–33	LR/SC reservation addresses (atomics)
34–39	M-mode Control and Status Registers (CSRs), handles traps
40–46	Temporaries for virtual instruction sequences
47–63	Temporaries for inline sequences

Register read/write correctness is proven using Twist with $K = 64$ and $d = 1$ . Each cycle reads up to two source registers (rs1, rs2) and writes one destination (rd), so there are two $ra$ polynomials and one $wa$ polynomial, all batched into a single sum-check instance.

No separate one-hot checks for registers: Normally in Twist, the prover commits to $ra_{i}$ polynomials and must prove they’re one-hot. For registers this is unnecessary: which register gets accessed is already determined by the instruction’s rs1/rs2/rd fields, and the bytecode Shout already proves those. So the register $ra$ at a random point can be derived via a sum-check over the bytecode’s $ra$ and the register index fields — no separate commitment or one-hot proof needed.

Instruction Execution

Source: Jolt Book - Instruction Execution

The key insight of Jolt: instruction execution is a single giant lookup. For each instruction, the two 64-bit operands $(x, y)$ form the lookup index, and the table returns the correct output. Their bits are interleaved:

(k_{1}, k_{2}, \dots, k_{128}) = (x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{64}, y_{64})

This gives a table of size $K = 2^{128}$ , which obviously cannot be materialized.

Prefix-Suffix Decomposition

The trick: for many instructions, the table’s MLE has prefix-suffix structure (Appendix A of Proving CPU Executions in Small Space). A multilinear polynomial $a (x_{1}, \dots, x_{n})$ has prefix-suffix structure for cutoff $i$ with $k$ terms if:

a (x_{1}, \dots, x_{n}) = j = 1 \sum k prefix_{j} (x_{1}, \dots, x_{i}) \cdot suffix_{j} (x_{i + 1}, \dots, x_{n})

The prefix-suffix inner product protocol exploits this to run the sum-check $\sum_{x} u (x) \cdot a (x)$ (where $u$ is the sparse one-hot address) without ever materializing the $2^{128}$ -entry table. It proceeds in $C$ stages, each handling $n / C$ sum-check rounds:

At each stage, the prover makes a single pass over the sparse $u$ to build a small array $Q$ of size $N^{1/ C}$ (aggregating suffix contributions).
It builds a small array $P$ of size $N^{1/ C}$ (prefix evaluations).
The sum-check for those $n / C$ rounds reduces to an inner product $P (y) \cdot Q (y)$ , which is only $N^{1/ C}$ -dimensional.

For this to work, $a$ must have prefix-suffix structure at every cutoff $n / C, 2 n / C, \dots, (C - 1) n / C$ (not just one). The total prover cost is $O (C \cdot k \cdot m)$ where $m$ is the sparsity of $u$ (= number of trace cycles $T$ ).

Example: SLT (Set Less Than)

SLT maps $(x, y) \mapsto 1$ if $x < y$ , else $0$ . Recall the index is interleaved: $k = (x_{1}, y_{1}, x_{2}, y_{2}, \dots)$ . With a cutoff after 16 bits (8 bit-pairs), $k_{prefix}$ contains the high bits $(x_{1}, y_{1}, \dots, x_{8}, y_{8})$ and $k_{suffix}$ the low bits $(x_{9}, y_{9}, \dots, x_{64}, y_{64})$ .

The comparison decomposes as: $x < y$ iff the high bits of $x$ are less than those of $y$ , OR the high bits are equal and the low bits of $x$ are less than those of $y$ :

Val_{SLT} (k_{prefix}, k_{suffix}) = LT_{high} (k_{prefix}) \cdot 1 + EQ_{high} (k_{prefix}) \cdot LT_{low} (k_{suffix})

This is $k = 2$ terms. Term 2 has non-trivial factors on both sides: whether the suffix matters is gated by the prefix bits being equal. This structure holds at every cutoff boundary (it’s the standard recursive definition of lexicographic comparison), so it works for any $C$ .

Example: XOR

XOR is a simpler case. Each output bit depends on a single input bit-pair independently: $x_{i} \oplus y_{i} = x_{i} + y_{i} - 2 x_{i} y_{i}$ . So the decomposition is purely additive with no cross-boundary interaction:

Val_{XOR} (k_{prefix}, k_{suffix}) = prefix_{XOR} (k_{prefix}) \cdot 1 + 1 \cdot suffix_{XOR} (k_{suffix})

where $prefix_{XOR}$ sums the weighted per-bit XORs for bits in the prefix, and $suffix_{XOR}$ does the same for the suffix. This is $k = 2$ but both terms have a trivial factor (constant 1 on one side).

Multiplexing Between Instructions

Since different instructions have different tables, a boolean lookup table flag $flag_{ℓ} (j)$ (fetched from the bytecode) indicates which table is active at cycle $j$ . The multiplexed read-checking sum-check becomes:

rv (r_{cycle}) = k, j \sum eq (r_{cycle}, j) \cdot (i = 1 \prod d ra_{i} (k_{i}, j)) \cdot (ℓ \sum flag_{ℓ} (j) \cdot Val_{ℓ} (k))

In practice, $d = 16$ for instruction execution, giving $K^{1/ d} = 2^{128/16} = 2^{8} = 256$ (i.e., each of the 16 one-hot chunks has 256 entries). The sum-check degree per round is $d + 1 = 17$ . Jolt uses techniques from Karatsuba/Toom-Cook to optimize the degree-17 polynomial evaluations.

Recovering Dense Operands (raf)

The R1CS constraints need the dense operand values (as field elements), not one-hot encodings. These are recovered via raf-evaluation sum-checks (see Twist and Shout, “Recovering dense addresses”]]). For two interleaved operands:

LeftOperand (r) = k, j \sum eq (r, j) \cdot ra (k, j) \cdot ℓ = 0 \sum l o g (K) /2 - 1 2^{ℓ} \cdot k_{2 ℓ}

RightOperand (r) = k, j \sum eq (r, j) \cdot ra (k, j) \cdot ℓ = 0 \sum l o g (K) /2 - 1 2^{ℓ} \cdot k_{2 ℓ + 1}

These extract the even-indexed bits (left operand $x$ ) and odd-indexed bits (right operand $y$ ) respectively. The resulting dense values are fed into the R1CS constraints for linking.

For instructions with a single operand (e.g., range checks), the index is not interleaved: the first 64 bits are zero-padded, and only RightOperand is non-trivial.

Committed Polynomials (according to Claude)

All committed polynomials are opened via Dory, which has “pay-per-bit” costs: boolean entries (the one-hot 1s) are much cheaper to commit than full field elements.

Polynomial	Component	Hypercube size	Non-zero entries	Entry type
$ra_{1}, \dots, ra_{16}$	Instruction exec (Shout)	$2^{8} \times T$ each	$T$ each (sparse)	Boolean
$ra_{1}, \dots, ra_{d}$	Bytecode (Shout)	$P^{1/ d} \times T$ each	$T$ each (sparse)	Boolean
$ra_{1}, \dots, ra_{d}$	RAM (Twist)	$K^{1/ d} \times T$ each	$\leq T$ each (sparse)	Boolean
$z$	Spartan (R1CS witness)	$W \times T$	Dense	Field elements
$Inc$	Registers (Twist)	$T$	Dense	Field elements
Parameters:

Instruction execution: $d = 16$ , $K^{1/ d} = 2^{8} = 256$ . This is the largest commitment: $16 T$ boolean entries total.
Bytecode: $d$ depends on program size $P$ (not a fixed constant).
RAM: $K^{1/ d} = 2^{4}$ or $2^{8}$ depending on $T$ . Only $ra$ (no separate $wa$ ) since RV64IMAC does at most one memory operation per cycle.
Spartan: $W$ is the number of R1CS variables per cycle (exact value unknown).
Advice: Not separately committed polynomials. They are folded into the RAM initial state polynomial $ram_init$ . “Trusted” advice has an externally-generated commitment; “untrusted” advice is committed by the prover. Both occupy the lowest addresses in the RAM table.

Uncertain / needs verification

Register $ra$ / $wa$ : The register addresses are derived from bytecode (see “no separate one-hot checks” above). This suggests they are virtual (not committed), meaning the register Twist only commits to $Inc$ . But the docs are not fully explicit.
RAM $Inc$ : The architecture overview lists it as committed, but the RAM-specific page describes it as virtual (proven through sum-checks). Unclear which is correct.
RAM $wa$ : With one memory op per cycle, there may be a single merged address polynomial rather than separate $ra$ / $wa$ .

TODO

How Spartan R1CS glues the components (~20 constraints/cycle, PC updates, linking)
RAM (Twist instance, memory layout, output verification)
Virtual instructions (division decomposition, virtual registers 40-46)
Inlines (custom instructions, SHA-256 benchmarks, virtual registers 47-63)
The proof DAG: committed vs virtual polynomials, sum-check stages, where time is spent
Interesting facts (e.g., d values per component)

Crypto Summaries

Explorer

Jolt