Last updated: April 5, 2026 · Pricing & Deployment · by Daniel Ashford

What is Self-Hosting?

QUICK ANSWER

Running an LLM on your own hardware instead of using a cloud API.

Definition

Self-hosting means running a language model on your own infrastructure rather than using a third-party API. This gives complete data control, eliminates per-token costs, and allows customization through fine-tuning.

How It Works

Self-hosting is only practical with open-weight models like Llama 4, Mistral, Qwen, and DeepSeek. Requirements include NVIDIA GPUs with sufficient VRAM. Monthly costs range from $2-5K for a single setup to $10-30K for production. The breakpoint where self-hosting becomes cheaper than API calls is typically above 50-100M tokens per month.

Example

A hospital self-hosts Llama 4 405B on 4x H100 GPUs ($8K/month) to ensure patient data never leaves their infrastructure.

Related Terms

Open Source / Open Weights
LLMs whose model weights are publicly available for download and self-hosting.
GPU (Graphics Processing Unit)
The specialized hardware that LLMs run on.
VRAM
The GPU memory that determines which models can run on which hardware.
Quantization
Compressing an LLM to use less memory by reducing numerical precision.

See How Models Compare

Understanding self-hosting is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated