neuralboot

Trapetum

Press the model, keep the oil.

A trapetum is the Roman olive press: it crushes the whole fruit to extract the concentrated essence. Trapetum does the same to an LLM, pressing the weights down to a small 4-bit code and decoding it straight inside the matmul, so the weight matrix is never materialized. Fused CUDA kernels plus a pure-Rust runtime.

16/16
tokens match HuggingFace
135
tok/s on one RTX 4090
3.5 GB
weights, vs 13 GB fp16
2.1×
less energy per token

Real Llama-2-7B, 4-bit, in pure Rust. No Python at runtime.

What do these numbers mean?

NVIDIA CUDA

Runs on any modern NVIDIA GPU, not just the 4090

The RTX 4090 above is only our reference card. Trapetum is CUDA and runs on any Ampere, Ada or Hopper GPU (A100, A40, RTX 30/40, H100, and more): recompile with one -arch flag, or ship a single multi-arch binary. The 4-bit memory win is universal (a 7B in ~3.5 GB fits an 8 GB card); the decode speedup follows the bandwidth law, largest on consumer cards, parity on an H100. See the per-GPU table.

It reproduces HuggingFace, token for token

Loaded from a 3.5 GB .cbk file, the runtime decodes the same quantized weights as HuggingFace and reproduces its greedy generation exactly, with no Python in the loop.

$ ./trapetum llama-2-7b.cbk          # pure Rust, no Python

loaded  llama-2-7b.cbk  (3.5 GB)  in 3.1 s
prompt:  "The capital of France is"

logits vs HuggingFace (worst)    7.9e-3
top-1 agreement with HF          6 / 6
greedy continuation = HF        16 / 16 tokens   OK
decode throughput               135 tok/s
energy                          2.58 J/token   (2.1x less than fp16)

An honest Pareto, one machine

Speed, memory, accuracy and energy for the same Llama-2-7B on an RTX 4090, batch-1 decode, iso-context. The pure-Rust runtime is the fastest and the lowest-energy path.

Trapetum vs baselines: decode speed and energy on Llama-2-7B
fp16 vs Trapetum 4-bit (Python and Rust) vs AQLM 2-bit. Rust runtime in orange.

Honest limits

Everything is public, measured on real hardware, and reproducible with one command, including the negative results.