Run real LLMs privately, at 4-bit, on your own GPU.

Pure Rust, no Python at runtime, nothing leaves your machine during inference. Trapetum presses the weights into a small 4-bit code and decodes it straight inside the matmul, so the full weight matrix is never materialized. (The name is the Roman olive press: crush the fruit, keep the essence.)

$ ./trapetum llama-2-7b.cbk          # pure Rust, no Python

loaded  llama-2-7b.cbk  (3.5 GB)  in 3.1 s
prompt:  "The capital of France is"

logits vs HuggingFace (worst)    7.9e-3
top-1 agreement with HF          6 / 6
greedy continuation = HF        16 / 16 tokens   OK
decode throughput               135 tok/s
energy                          2.58 J/token   (2.1x less than fp16)

curl -fsSL get.neuralboot.com | sh Linux. Windows and details

Run it on your machine Star on GitHub

See the benchmarks · Read the paper

16/16

tokens match HuggingFace

135

tok/s on one RTX 4090

3.9×

smaller, 3.5 GB vs 13.5 GB fp16

2.1×

less energy per token

Real Llama-2-7B, 4-bit, in pure Rust. No Python at runtime.
Watch the live 4-bit vs fp16 comparison on two identical RTX 4090s, or read the plain-language overview.

What do these numbers mean?

16/16 tokens match HuggingFace: given the same prompt, the 4-bit model emits the exact same next 16 tokens (greedy) as the original fp16 model. Proof the quantization, the kernel and the Rust pipeline are correct, not approximate (worst-case logit gap 7.9e-3).
135 tok/s on one RTX 4090: generation speed, one token at a time (batch 1), on a single consumer GPU. How fast it “types” the answer.
3.5 GB on disk (4.7 GB VRAM), vs ~13.5 GB fp16: the weights shrink from ~13.5 GB (16-bit) to 3.5 GB (4-bit), so a 7B model fits on a small or consumer GPU instead of needing a datacenter card.
2.1× less energy per token: each generated token costs 2.1× less energy than fp16 (2.58 vs 5.45 joules per token, measured from GPU power draw). Cheaper and greener to run.

NVIDIA CUDA + APPLE METAL

Runs on any modern NVIDIA GPU, and on Apple Silicon

The RTX 4090 above is only our reference card. Trapetum has two GPU backends: CUDA for any Ampere, Ada or Hopper NVIDIA card (benchmarked on RTX 4090, A40 and H100), and Metal for Apple Silicon Macs, M1 to M4 (benchmarked on M4). The 4-bit memory win is universal (a 7B in ~3.5 GB fits an 8 GB card); the decode speedup follows the bandwidth law, largest where memory bandwidth is scarcest. See the per-GPU table and the Apple M4 numbers.

APPLE SILICON LIVE · AMD COMING SOON

Now on Apple Silicon; AMD is on the way

The fused 4-bit decode is ported to Metal and runs on the Apple GPU today (M1 to M4), no NVIDIA needed. Download for macOS or see the measured M4 numbers. An AMD backend (ROCm / HIP) is still in active development: the pure-Rust runtime and the codebook format are already portable, so the remaining work is the kernel. The 4-bit memory win carries over unchanged; the decode speedup follows the same bandwidth law. Want AMD first? Tell us your GPU.

It reproduces HuggingFace, token for token

Loaded from a 3.5 GB .cbk file, the runtime decodes the same quantized weights as HuggingFace and reproduces its greedy generation exactly, with no Python in the loop. The terminal output above is a real run, not a mockup: worst-case logit gap 7.9e-3, greedy continuation identical for 16 of 16 tokens. Full model-level results.

An honest Pareto, one machine

Speed, memory, accuracy and energy for the same Llama-2-7B on an RTX 4090, batch-1 decode, iso-context. The pure-Rust runtime is the fastest and the lowest-energy path.

Trapetum vs baselines: decode speed and energy on Llama-2-7B — fp16 vs Trapetum 4-bit (Python and Rust) vs AQLM 2-bit. Rust runtime in orange.

Honest limits

It does not beat dense fp16 on accuracy. Nothing does; quantization trades a little perplexity for a lot of memory.
The kernel speedup is a memory-bandwidth law: largest where memory bandwidth is scarcest, near parity on an H100 where fp16 is already near roofline.
Batch-1 decode for now. Batched throughput is an open problem, marked as such.

Everything is public, measured on real hardware, and reproducible with one command, including the negative results.

github.com/neuralboot/trapetum