Last updated: April 5, 2026 · Core Concepts · by Daniel Ashford
What is Inference?
The process of an LLM generating a response to your input.
Definition
Inference is the process of running a trained language model to generate output from a given input. When you send a prompt to an LLM API and receive a response, that entire process is inference. Unlike training (which updates the model weights), inference uses existing weights to produce predictions.
How It Works
Inference speed is measured in tokens per second (TPS) and time-to-first-token (TTFT). Faster inference means lower latency. Inference costs are what you pay per API call — typically measured in cost per million tokens. Self-hosted inference requires GPU hardware and costs are measured in compute time.
Example
When Claude Opus 4 generates a 500-word response in 2.1 seconds, that is inference. The 2.1 seconds is the inference latency, and the cost is calculated based on the tokens consumed.
Related Terms
See How Models Compare
Understanding inference is important when choosing the right AI model. See how 12 models compare on our leaderboard.