[000_calculator]
diff. LLM inference simulator
Estimates model VRAM fit, prefill latency from prompt size and compute, and decode throughput from memory bandwidth. Results are simplified planning signals, not benchmark guarantees.
Estimate model fit, prefill latency, and decode throughput across practical GPU choices.
SYS: ONLINE
V.0.1 [LIVE]
GPU Provider
Launch on RunPod
Need GPUs to run this?
Spin up GPU instances on RunPod and benchmark your model against this estimate.
Models
—
GPU
—
1
Inference
—
Model size
Quantization
Prompt tokens (prefill)
512
Calculation
Model size
—
GB
Fits in VRAM
—
Prefill time
—
ms
Token/s (decode)
—
tok/s
Recommendation
Run a simulation to size the setup.
Select models and hardware, then simulate to get a practical read on fit, prompt latency, and decode speed.
VRAM
Weights must fit in aggregate GPU memory before serving is realistic.
Prefill
Prompt processing depends on model size, prompt length, and total compute.
Decode
Generation speed is usually constrained by memory bandwidth.
Select one or more models and run a simulation to see concurrent streamed output…