Building a Multi-Provider AI Gateway with LiteLLM for Web Applications (Complete 2026 Guide)
A complete Indonesian-to-English tutorial for building a production-ready AI Gateway using LiteLLM + FastAPI: from concepts, architecture, and runnable implementation to best practices and common pitfalls you must avoid.
Building a Multi-Provider AI Gateway with LiteLLM for Web Applications (Complete 2026 Guide)
Level: Intermediate
Estimated reading time: 15 minutes
Stack: Python, FastAPI, LiteLLM Proxy, Redis (optional), Docker
1) Introduction — What & Why
If you are building AI features in a web application, the first problem is usually not “which model is the smartest,” but how to manage multiple models cleanly.
Real-world examples:
- Today you use OpenAI because it is fast.
- Tomorrow you need Anthropic because its reasoning quality is better.
- Next week the finance team asks for a cheaper model for specific endpoints.
- Then rate limits, observability, and ballooning bills become problems.
If every provider change forces you to modify many parts of your application code, that is a sign your architecture is not yet scalable.
This is where an AI Gateway comes in. With a gateway, your application talks to one internal endpoint, and the gateway handles routing to the right AI provider. One of the most popular open-source tools for this in 2026 is LiteLLM (currently trending on GitHub).
In this tutorial, we will build a practical architecture:
- FastAPI app (your web backend)
- LiteLLM Proxy (model gateway)
- fallback, timeout, error handling
- logging + production best practices
The goal is simple: you can switch AI providers without overhauling the application.
2) Prerequisites
Before starting, make sure you have:
- Python 3.11+
- Basic FastAPI and REST API knowledge
- At least one provider API key (e.g., OpenAI or Anthropic)
- Docker & Docker Compose (optional, but highly recommended)
- Basic understanding of environment variables
Install local dependencies:
python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install --upgrade pip pip install fastapi uvicorn httpx pydantic-settings python-dotenv
For the gateway:
pip install "litellm[proxy]"
3) Core Concepts (with analogy)
Imagine you run a delivery restaurant.
- Customers: your web application
- Kitchen A: OpenAI
- Kitchen B: Anthropic
- Kitchen C: local model
- Kitchen manager: LiteLLM Gateway
Customers do not need to know which kitchen is cooking. They just order through the manager, and the manager picks the best kitchen based on rules: price, load, or quality.
Key concepts you should understand:
-
Unified API Format All providers are wrapped in an OpenAI-like format, so client code stays consistent.
-
Routing & Fallback If the primary model fails or hits a rate limit, the request automatically moves to a backup model.
-
Policy Layer Rules such as timeout, budget, key management, and logging are placed in the gateway, not scattered across the app.
-
Observability You can monitor usage, latency, and errors from a single entry point.
In short, an AI Gateway is like a universal adapter + traffic controller.
4) Architecture / Diagram
Simple architecture we will implement:
+---------------------+ +--------------------------+ | Frontend (Web/App) | HTTPS | FastAPI Backend | | React / Next.js +--------->+ /api/chat | +---------------------+ | - validate input | | - auth/rate limit app | +------------+-------------+ | | OpenAI-compatible API v +--------------------------+ | LiteLLM Proxy Gateway | | - model routing | | - fallback | | - cost control | | - logs/metrics | +-----+--------------+-----+ | | v v OpenAI API Anthropic API
Why keep the backend layer (instead of frontend calling gateway directly)?
- Keep API keys secure
- Add business rules (user quotas, role-based behavior)
- Easier auditing and tracing per internal user
5) Step-by-Step Implementation (Complete Runnable)
Step A — Prepare LiteLLM config
Create litellm_config.yaml:
model_list: - model_name: primary-chat litellm_params: model: openai/gpt-4o-mini api_key: os.environ/OPENAI_API_KEY - model_name: fallback-chat litellm_params: model: anthropic/claude-sonnet-4-20250514 api_key: os.environ/ANTHROPIC_API_KEY router_settings: routing_strategy: simple-shuffle num_retries: 2 timeout: 20 general_settings: master_key: os.environ/LITELLM_MASTER_KEY
Note: model names
primary-chatandfallback-chatare internal aliases.
Step B — Run LiteLLM Proxy
export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." export LITELLM_MASTER_KEY="super-secret-master-key" litellm --config litellm_config.yaml --port 4000
If successful, the gateway is active at http://localhost:4000.
Step C — Create FastAPI app
Create app.py:
from __future__ import annotations import os from typing import Any import httpx from fastapi import FastAPI, HTTPException from pydantic import BaseModel, Field app = FastAPI(title="AI Gateway Demo", version="1.0.0") LITELLM_BASE_URL = os.getenv("LITELLM_BASE_URL", "http://localhost:4000") LITELLM_API_KEY = os.getenv("LITELLM_API_KEY", "super-secret-master-key") REQUEST_TIMEOUT_SECONDS = float(os.getenv("REQUEST_TIMEOUT_SECONDS", "25")) class ChatRequest(BaseModel): message: str = Field(min_length=1, max_length=4000) user_id: str = Field(min_length=1, max_length=128) class ChatResponse(BaseModel): reply: str model_used: str @app.get("/health") async def health() -> dict[str, str]: return {"status": "ok"} @app.post("/api/chat", response_model=ChatResponse) async def chat(req: ChatRequest) -> ChatResponse: payload: dict[str, Any] = { "model": "primary-chat", "messages": [ { "role": "system", "content": "Kamu asisten teknis yang jawab singkat, jelas, dan akurat.", }, {"role": "user", "content": req.message}, ], "temperature": 0.3, "metadata": { "app_user_id": req.user_id, "feature": "web_chat", }, } headers = { "Authorization": f"Bearer {LITELLM_API_KEY}", "Content-Type": "application/json", } try: async with httpx.AsyncClient(timeout=REQUEST_TIMEOUT_SECONDS) as client: resp = await client.post( f"{LITELLM_BASE_URL}/v1/chat/completions", headers=headers, json=payload, ) # Tangani error response dari gateway if resp.status_code >= 400: detail = resp.text[:500] raise HTTPException( status_code=502, detail=f"Gateway error ({resp.status_code}): {detail}", ) data = resp.json() choices = data.get("choices", []) if not choices: raise HTTPException(status_code=502, detail="Empty response from gateway") message = choices[0].get("message", {}) content = message.get("content", "") model_used = data.get("model", "unknown") return ChatResponse(reply=content, model_used=model_used) except httpx.TimeoutException: raise HTTPException(status_code=504, detail="AI request timed out") except httpx.HTTPError as exc: raise HTTPException(status_code=502, detail=f"Network error to gateway: {exc}") except ValueError: raise HTTPException(status_code=502, detail="Invalid JSON from gateway")
Run backend:
export LITELLM_BASE_URL="http://localhost:4000" export LITELLM_API_KEY="super-secret-master-key" uvicorn app:app --reload --port 8000
Test endpoint:
curl -X POST "http://localhost:8000/api/chat" -H "Content-Type: application/json" -d '{ "message": "Jelaskan apa itu AI Gateway dalam 2 kalimat.", "user_id": "user-123" }'
Step D — Add manual fallback at application level (optional)
Even though the gateway already has retries, sometimes you need specific fallback logic per feature. Example helper:
async def request_with_fallback(client: httpx.AsyncClient, headers: dict[str, str], user_text: str) -> dict: models = ["primary-chat", "fallback-chat"] for model_name in models: payload = { "model": model_name, "messages": [{"role": "user", "content": user_text}], } try: resp = await client.post( "http://localhost:4000/v1/chat/completions", headers=headers, json=payload, timeout=20, ) if resp.status_code < 400: return resp.json() except httpx.HTTPError: # Lanjut ke model berikutnya continue raise RuntimeError("All models failed")
This is useful for critical scenarios, such as support chatbots that must keep responding.
Step E — Docker Compose (ready to run)
Minimal docker-compose.yml:
version: "3.9" services: litellm: image: ghcr.io/berriai/litellm:main-stable command: ["--config", "/app/config.yaml", "--port", "4000"] ports: - "4000:4000" environment: OPENAI_API_KEY: ${OPENAI_API_KEY} ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY} LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY} volumes: - ./litellm_config.yaml:/app/config.yaml:ro backend: build: . ports: - "8000:8000" environment: LITELLM_BASE_URL: http://litellm:4000 LITELLM_API_KEY: ${LITELLM_MASTER_KEY} depends_on: - litellm
This makes dev/staging environments more consistent.
6) Best Practices (industry tips)
-
Separate app concerns from gateway concerns
- App: user auth, business logic
- Gateway: model routing, AI policy
-
Set strict timeouts Do not let AI requests hang too long. Ideally 10–30 seconds depending on the use case.
-
Use model tiering
- cheaper model for drafts/summarization
- premium model for heavy reasoning
-
Tag metadata per request Store
user_id,feature,teamso cost monitoring is more transparent. -
Implement graceful degradation If all models fail, return an informative fallback response, not a raw 500.
-
Do not hardcode API keys Always use a secret manager or environment variables.
-
Audit prompts & outputs For production products, add safety filters (PII redaction, prompt injection guardrails).
7) Common Mistakes (that happen often)
Mistake 1 — Frontend directly hits provider
Consequence: API keys leak, auditing is difficult, policy enforcement is difficult.
Solution: route through backend + gateway.
Mistake 2 — No fallback
When a provider outage happens, AI features fully go down.
Solution: prepare at least 1 cross-vendor backup model.
Mistake 3 — No cost monitoring per feature
Bills increase but the team does not know which endpoint caused it.
Solution: metadata + cost tracking dashboard.
Mistake 4 — Blind retries
Unlimited retries worsen latency and costs.
Solution: limited retries + exponential backoff + circuit breaker.
Mistake 5 — Overly optimistic response parsing
Assuming choices[0].message.content always exists.
Solution: validate JSON structure before using it.
8) Advanced Tips (for those who want to go deeper)
-
Canary routing for new models Send 5–10% of traffic to a new model, compare quality + cost + latency.
-
Dynamic model selection Choose model based on prompt length, SLA, or user tier (free/pro).
-
Response caching For repeated questions (e.g., FAQ), caching can drastically reduce costs.
-
Integrated observability Integrate logs into observability platforms (e.g., Langfuse/MLflow/Helicone) for end-to-end traces.
-
A/B testing prompt templates Do not only switch models; sometimes proper prompt engineering can cut costs without reducing quality.
-
Multi-region failover If users are global, prepare region-aware routing for lower latency.
9) Summary & Next Steps
We have covered how to build a modern AI Gateway with LiteLLM:
- Why an AI Gateway is important for production web applications
- Core concepts of routing, fallback, unified interface
- Full implementation with FastAPI + LiteLLM
- Best practices to keep it secure, cost-efficient, and scalable
- Common mistakes and advanced strategies
Next steps I recommend:
- Deploy this architecture to staging.
- Add per-user rate limiting on
/api/chatendpoint. - Build an internal dashboard for metrics: latency p95, error rate, cost/feature.
- Run outage simulation (turn off primary provider, ensure fallback works).
If this stage is complete, you already have an AI platform foundation that is far more production-ready than direct provider integration.
10) References
- LiteLLM Docs (Getting Started): https://docs.litellm.ai/docs/
- LiteLLM Proxy Quick Start: https://docs.litellm.ai/docs/proxy/quick_start
- LiteLLM GitHub Repo: https://github.com/BerriAI/litellm
- GitHub Trending: https://github.com/trending
- X Explore: https://x.com/explore
- Medium Software Engineering Tag: https://medium.com/tag/software-engineering
If you want, in the next advanced version we can discuss: “Implementing per-team budget enforcement + internal chargeback with LiteLLM” so AI cost control becomes more precise.