Last updated: April 5, 2026 · Pricing & Deployment · by Daniel Ashford
What is Latency?
How long it takes to receive the first token of a response.
Definition
Latency refers to the time delay between sending a request and receiving the response. The most common metric is Time to First Token (TTFT) — how long until the first token appears.
How It Works
Latency varies dramatically: Gemini 2.5 Flash achieves 0.4s TTFT, while Claude Opus 4 averages 2.1s. For real-time chat, sub-1-second TTFT is generally required. Factors include model size, server load, geographic distance, and prompt length. Streaming improves perceived latency.
Example
Claude Opus 4 has 2.1s TTFT — acceptable for research tasks, but potentially too slow for a real-time chatbot.
Related Terms
See How Models Compare
Understanding latency is important when choosing the right AI model. See how 12 models compare on our leaderboard.