In the Weeds: How to Run Multiple AI Models with LiteLLM
A technical deep-dive on model routing, fallbacks, and cost optimization using LiteLLM — the universal API gateway for AI models.
The Problem With One Model
Here is a scenario every AI developer hits eventually. You have built your app on GPT-4. It is working great. Then one day, your costs triple because usage spiked. Or the API goes down for thirty minutes during peak hours. Or you realize Claude is better at the specific task your app needs. Or a new open-source model drops that could handle 60% of your requests at a tenth the cost.
You are locked in. Switching means rewriting your API calls, changing your prompt format, updating your response parsing. Every model provider has a slightly different API shape, different auth, different quirks.
LLiteLLM exists to solve exactly this problem. And once you understand it, you will wonder why you ever talked to a model provider directly.
What LLiteLLM Does
LiteLLM is a universal API gateway. It provides a single, OpenAI-compatible interface that routes to over 100 different AI model providers. Anthropic, OpenAI, Google, Mistral, Cohere, Replicate, local models through Ollama — all of them, same API format.
You write your code once. LiteLLM handles the translation.
pythonfrom litellm import completion
Call Claude
response = completion(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
)
Call GPT-4 — same exact interface
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
Call a local Ollama model — still the same
response = completion(
model="ollama/llama3",
messages=[{"role": "user", "content": "Hello!"}]
)
Same function. Same parameters. Same response format. The only thing that changes is the model string.
Setting Up the Proxy Server
LiteLLM can run as a Python library (as shown above) or as a standalone proxy server. The proxy server is where things get really powerful, because it lets any application — regardless of language — talk to any model through a single endpoint.
Install and start:
That is it. You now have a local server running on port 4000 that accepts OpenAI-format requests and routes them to Claude. Any tool that speaks the OpenAI API can now use Claude without any code changes.
But here is where it gets interesting.
Model Routing and Fallbacks
The real power of LiteLLM is in its routing configuration. Create a config.yaml:
yamlmodel_list:
- model_name: "fast"
litellm_params:
model: "claude-haiku-4-20250514"
api_key: "sk-ant-..."
- model_name: "smart"
litellm_params:
model: "claude-sonnet-4-20250514"
api_key: "sk-ant-..."
- model_name: "smart"
litellm_params:
model: "gpt-4"
api_key: "sk-..."
- model_name: "cheap"
litellm_params:
model: "ollama/llama3"
api_base: "http://localhost:11434"
Notice that "smart" has two entries. LiteLLM will load-balance between them. If one fails, it automatically falls back to the other. Your app just asks for "smart" and gets whichever is available.
Start the proxy with this config:
bashlitellm --config config.yaml
Now your app can request "fast", "smart", or "cheap" models without knowing or caring which provider is actually handling the request.
Cost Optimization Strategies
This is where LiteLLM pays for itself. Literally.
Route by task complexity. Simple tasks — classification, extraction, yes/no questions — go to the "cheap" tier. Complex tasks — analysis, code generation, creative writing — go to "smart." You can cut your AI spend by 40-60% without any quality loss on the tasks that matter.
Set budgets and rate limits. LiteLLM supports per-key and per-model budgets:
yamllitellm_settings:
max_budget: 100 # USD per month
budget_duration: "monthly"
When you hit the budget, requests get rejected with a clear error instead of silently running up your bill.
Track spending in real time. LiteLLM logs every request with cost data. You can see exactly how much each model, each user, each feature is costing you.
Load Balancing
When you define multiple providers for the same model name, LiteLLM balances across them. But you can control how:
yamlrouter_settings:
routing_strategy: "least-busy" # or "simple-shuffle", "latency-based-routing"
num_retries: 3
timeout: 30
The "least-busy" strategy sends requests to whichever provider has the fewest in-flight requests. "Latency-based-routing" learns which provider is fastest and prefers it. Both include automatic failover — if a provider returns a 500 or times out, the request silently retries on the next one.
Your users never see an error. They just get a response.
Integration With the VVercel AI SDK
If you are building a web app, VVercel AI SDK pairs beautifully with LiteLLM. The AI SDK provides frontend streaming and React hooks. LiteLLM provides the backend routing.
Point the AI SDK at your LiteLLM proxy:
typescriptimport { createOpenAI } from "@ai-sdk/openai";
const litellm = createOpenAI({
baseURL: "http://localhost:4000/v1",
apiKey: "your-litellm-key",
});
const result = await generateText({
model: litellm("smart"),
prompt: "Explain quantum computing",
});
Now your frontend gets streaming responses from whichever model LiteLLM routes to. You can switch from Claude to GPT-4 to a local model by changing the config file. No code changes. No redeployment.
Caching
LiteLLM supports response caching, which can dramatically reduce costs for repeated queries:
yamllitellm_settings:
cache: true
cache_params:
type: "redis"
host: "localhost"
port: 6379
ttl: 3600
Identical requests return cached responses instantly. This is especially valuable for classification and extraction tasks where the same inputs appear frequently.
Monitoring and Observability
You cannot optimize what you cannot measure. LiteLLM includes built-in logging and integrates with observability tools:
yamllitellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
Every request gets logged with: model used, latency, token count, cost, success/failure. You can see which models are performing best, which are costing the most, and where failures are happening.
Try This Now
- Install LiteLLM:
pip install litellm[proxy] - Create a config.yaml with at least two model providers
- Start the proxy and point an existing app at it
- Monitor costs for a week and see where you can optimize
Production Deployment Tips
Run the proxy as a service. Use Docker, systemd, or your orchestrator of choice. LiteLLM provides an official Docker image.
Put it behind a reverse proxy. Nginx or Caddy in front of LiteLLM gives you TLS, rate limiting, and access logging.
Use virtual keys. LiteLLM can issue API keys that map to specific models and budgets. Give each team or service its own key with its own limits.
Start with two providers. You do not need to set up ten models on day one. Start with your primary provider and one fallback. Add more as you understand your traffic patterns.
The Bigger Picture
LLiteLLM is infrastructure. It is the kind of tool that sits quietly in your stack, doing its job, saving you money, and preventing outages. You do not think about it most days. But the day your primary model provider goes down and your app keeps working because LiteLLM silently failed over? That is the day you send the maintainers a thank-you note.
Model lock-in is a real risk in AI development. LiteLLM eliminates it. And that freedom — to choose the best model for each task, to switch providers without code changes, to control costs without sacrificing quality — is worth more than any single model improvement.
Ratings & Reviews
0.0
out of 5
0 ratings
No reviews yet. Be the first to share your experience.