Last updated: April 5, 2026 · Pricing & Deployment · by Daniel Ashford

What is Latency?

QUICK ANSWER

How long it takes to receive the first token of a response.

Definition

Latency refers to the time delay between sending a request and receiving the response. The most common metric is Time to First Token (TTFT) — how long until the first token appears.

How It Works

Latency varies dramatically: Gemini 2.5 Flash achieves 0.4s TTFT, while Claude Opus 4 averages 2.1s. For real-time chat, sub-1-second TTFT is generally required. Factors include model size, server load, geographic distance, and prompt length. Streaming improves perceived latency.

Example

Claude Opus 4 has 2.1s TTFT — acceptable for research tasks, but potentially too slow for a real-time chatbot.

Related Terms

Inference
The process of an LLM generating a response to your input.
Streaming
Receiving the model response word-by-word in real-time instead of waiting for the full answer.
API (Application Programming Interface)
The technical interface that lets your software send prompts to an LLM and receive responses.

See How Models Compare

Understanding latency is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated