[000_calculator]

diff. LLM inference simulator

Estimate model fit, prefill latency, and decode throughput across practical GPU choices.

SYS: ONLINE V.0.1 [LIVE]

GPU Provider

Need GPUs to run this?

Spin up GPU instances on RunPod and benchmark your model against this estimate.

Models —

Model

GPU —

GPU

Number of GPUs

Inference —

Inference settings

Model size

Quantization

Prompt tokens (prefill) 512

256 512 1k 2k 4k

Calculation

Model size

—

Fits in VRAM

—

Prefill time

—

Token/s (decode)

—

tok/s

Recommendation

PENDING

Select models and hardware, then simulate to get a practical read on fit, prompt latency, and decode speed.

VRAM

Weights must fit in aggregate GPU memory before serving is realistic.

Prefill

Prompt processing depends on model size, prompt length, and total compute.

Decode

Generation speed is usually constrained by memory bandwidth.

Output

Select one or more models and run a simulation to see concurrent streamed output…

Idle