Press the model, keep the oil.
Run real LLMs at 4-bit on your own GPU. Pure Rust, no Python at runtime, nothing leaves your machine. Trapetum presses the weights into a small 4-bit code and decodes it straight inside the matmul, so the full weight matrix is never materialized. (The name is the Roman olive press: crush the fruit, keep the essence.)
Real Llama-2-7B, 4-bit, in pure Rust. No Python at runtime.
NVIDIA CUDA
The RTX 4090 above is only our reference card. Trapetum is CUDA and runs on any
Ampere, Ada or Hopper GPU (A100, A40, RTX 30/40, H100, and more): recompile with one
-arch flag, or ship a single multi-arch binary. The 4-bit memory win is
universal (a 7B in ~3.5 GB fits an 8 GB card); the decode speedup follows the
bandwidth law, largest on consumer cards, parity on an H100.
See the per-GPU table.
Loaded from a 3.5 GB .cbk file, the runtime decodes the same quantized weights as
HuggingFace and reproduces its greedy generation exactly, with no Python in the loop.
$ ./trapetum llama-2-7b.cbk # pure Rust, no Python loaded llama-2-7b.cbk (3.5 GB) in 3.1 s prompt: "The capital of France is" logits vs HuggingFace (worst) 7.9e-3 top-1 agreement with HF 6 / 6 greedy continuation = HF 16 / 16 tokens OK decode throughput 135 tok/s energy 2.58 J/token (2.1x less than fp16)
Speed, memory, accuracy and energy for the same Llama-2-7B on an RTX 4090, batch-1 decode, iso-context. The pure-Rust runtime is the fastest and the lowest-energy path.
Everything is public, measured on real hardware, and reproducible with one command, including the negative results.