Last updated: April 5, 2026 · Core Concepts · by Daniel Ashford

What is Inference?

QUICK ANSWER

The process of an LLM generating a response to your input.

Definition

Inference is the process of running a trained language model to generate output from a given input. When you send a prompt to an LLM API and receive a response, that entire process is inference. Unlike training (which updates the model weights), inference uses existing weights to produce predictions.

How It Works

Inference speed is measured in tokens per second (TPS) and time-to-first-token (TTFT). Faster inference means lower latency. Inference costs are what you pay per API call — typically measured in cost per million tokens. Self-hosted inference requires GPU hardware and costs are measured in compute time.

Example

When Claude Opus 4 generates a 500-word response in 2.1 seconds, that is inference. The 2.1 seconds is the inference latency, and the cost is calculated based on the tokens consumed.

Related Terms

Latency
How long it takes to receive the first token of a response.
Tokens
The basic units of text that LLMs process — roughly 3/4 of a word.
GPU (Graphics Processing Unit)
The specialized hardware that LLMs run on.
API (Application Programming Interface)
The technical interface that lets your software send prompts to an LLM and receive responses.

See How Models Compare

Understanding inference is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated