Building a Multi-Provider AI Gateway with LiteLLM for Web Applications (Complete 2026 Guide)

Level: Intermediate

Estimated reading time: 15 minutes

Stack: Python, FastAPI, LiteLLM Proxy, Redis (optional), Docker

1) Introduction — What & Why

If you are building AI features in a web application, the first problem is usually not “which model is the smartest,” but how to manage multiple models cleanly.

Real-world examples:

Today you use OpenAI because it is fast.
Tomorrow you need Anthropic because its reasoning quality is better.
Next week the finance team asks for a cheaper model for specific endpoints.
Then rate limits, observability, and ballooning bills become problems.

If every provider change forces you to modify many parts of your application code, that is a sign your architecture is not yet scalable.

This is where an AI Gateway comes in. With a gateway, your application talks to one internal endpoint, and the gateway handles routing to the right AI provider. One of the most popular open-source tools for this in 2026 is LiteLLM (currently trending on GitHub).

In this tutorial, we will build a practical architecture:

FastAPI app (your web backend)
LiteLLM Proxy (model gateway)
fallback, timeout, error handling
logging + production best practices

The goal is simple: you can switch AI providers without overhauling the application.

2) Prerequisites

Before starting, make sure you have:

Python 3.11+
Basic FastAPI and REST API knowledge
At least one provider API key (e.g., OpenAI or Anthropic)
Docker & Docker Compose (optional, but highly recommended)
Basic understanding of environment variables

Install local dependencies:

python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install fastapi uvicorn httpx pydantic-settings python-dotenv

For the gateway:

pip install "litellm[proxy]"

3) Core Concepts (with analogy)

Imagine you run a delivery restaurant.

Customers: your web application
Kitchen A: OpenAI
Kitchen B: Anthropic
Kitchen C: local model
Kitchen manager: LiteLLM Gateway

Customers do not need to know which kitchen is cooking. They just order through the manager, and the manager picks the best kitchen based on rules: price, load, or quality.

Key concepts you should understand:

Unified API Format All providers are wrapped in an OpenAI-like format, so client code stays consistent.
Routing & Fallback If the primary model fails or hits a rate limit, the request automatically moves to a backup model.
Policy Layer Rules such as timeout, budget, key management, and logging are placed in the gateway, not scattered across the app.
Observability You can monitor usage, latency, and errors from a single entry point.

In short, an AI Gateway is like a universal adapter + traffic controller.

4) Architecture / Diagram

Simple architecture we will implement:

+---------------------+          +--------------------------+
| Frontend (Web/App)  |  HTTPS   | FastAPI Backend          |
| React / Next.js     +--------->+ /api/chat                |
+---------------------+          | - validate input         |
                                 | - auth/rate limit app    |
                                 +------------+-------------+
                                              |
                                              | OpenAI-compatible API
                                              v
                                 +--------------------------+
                                 | LiteLLM Proxy Gateway    |
                                 | - model routing          |
                                 | - fallback               |
                                 | - cost control           |
                                 | - logs/metrics           |
                                 +-----+--------------+-----+
                                       |              |
                                       v              v
                                  OpenAI API     Anthropic API

Why keep the backend layer (instead of frontend calling gateway directly)?

Keep API keys secure
Add business rules (user quotas, role-based behavior)
Easier auditing and tracing per internal user

5) Step-by-Step Implementation (Complete Runnable)

Step A — Prepare LiteLLM config

Create litellm_config.yaml:

model_list:
  - model_name: primary-chat
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: fallback-chat
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 2
  timeout: 20

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

Note: model names primary-chat and fallback-chat are internal aliases.

Step B — Run LiteLLM Proxy

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export LITELLM_MASTER_KEY="super-secret-master-key"

litellm --config litellm_config.yaml --port 4000

If successful, the gateway is active at http://localhost:4000.

Step C — Create FastAPI app

Create app.py:

from __future__ import annotations

import os
from typing import Any

import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

app = FastAPI(title="AI Gateway Demo", version="1.0.0")

LITELLM_BASE_URL = os.getenv("LITELLM_BASE_URL", "http://localhost:4000")
LITELLM_API_KEY = os.getenv("LITELLM_API_KEY", "super-secret-master-key")
REQUEST_TIMEOUT_SECONDS = float(os.getenv("REQUEST_TIMEOUT_SECONDS", "25"))


class ChatRequest(BaseModel):
    message: str = Field(min_length=1, max_length=4000)
    user_id: str = Field(min_length=1, max_length=128)


class ChatResponse(BaseModel):
    reply: str
    model_used: str


@app.get("/health")
async def health() -> dict[str, str]:
    return {"status": "ok"}


@app.post("/api/chat", response_model=ChatResponse)
async def chat(req: ChatRequest) -> ChatResponse:
    payload: dict[str, Any] = {
        "model": "primary-chat",
        "messages": [
            {
                "role": "system",
                "content": "Kamu asisten teknis yang jawab singkat, jelas, dan akurat.",
            },
            {"role": "user", "content": req.message},
        ],
        "temperature": 0.3,
        "metadata": {
            "app_user_id": req.user_id,
            "feature": "web_chat",
        },
    }

    headers = {
        "Authorization": f"Bearer {LITELLM_API_KEY}",
        "Content-Type": "application/json",
    }

    try:
        async with httpx.AsyncClient(timeout=REQUEST_TIMEOUT_SECONDS) as client:
            resp = await client.post(
                f"{LITELLM_BASE_URL}/v1/chat/completions",
                headers=headers,
                json=payload,
            )

        # Tangani error response dari gateway
        if resp.status_code >= 400:
            detail = resp.text[:500]
            raise HTTPException(
                status_code=502,
                detail=f"Gateway error ({resp.status_code}): {detail}",
            )

        data = resp.json()
        choices = data.get("choices", [])
        if not choices:
            raise HTTPException(status_code=502, detail="Empty response from gateway")

        message = choices[0].get("message", {})
        content = message.get("content", "")
        model_used = data.get("model", "unknown")

        return ChatResponse(reply=content, model_used=model_used)

    except httpx.TimeoutException:
        raise HTTPException(status_code=504, detail="AI request timed out")
    except httpx.HTTPError as exc:
        raise HTTPException(status_code=502, detail=f"Network error to gateway: {exc}")
    except ValueError:
        raise HTTPException(status_code=502, detail="Invalid JSON from gateway")

Run backend:

export LITELLM_BASE_URL="http://localhost:4000"
export LITELLM_API_KEY="super-secret-master-key"
uvicorn app:app --reload --port 8000

Test endpoint:

curl -X POST "http://localhost:8000/api/chat"   -H "Content-Type: application/json"   -d '{
    "message": "Jelaskan apa itu AI Gateway dalam 2 kalimat.",
    "user_id": "user-123"
  }'

Step D — Add manual fallback at application level (optional)

Even though the gateway already has retries, sometimes you need specific fallback logic per feature. Example helper:

async def request_with_fallback(client: httpx.AsyncClient, headers: dict[str, str], user_text: str) -> dict:
    models = ["primary-chat", "fallback-chat"]

    for model_name in models:
        payload = {
            "model": model_name,
            "messages": [{"role": "user", "content": user_text}],
        }
        try:
            resp = await client.post(
                "http://localhost:4000/v1/chat/completions",
                headers=headers,
                json=payload,
                timeout=20,
            )
            if resp.status_code < 400:
                return resp.json()
        except httpx.HTTPError:
            # Lanjut ke model berikutnya
            continue

    raise RuntimeError("All models failed")

This is useful for critical scenarios, such as support chatbots that must keep responding.

Step E — Docker Compose (ready to run)

Minimal docker-compose.yml:

version: "3.9"

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    ports:
      - "4000:4000"
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
    volumes:
      - ./litellm_config.yaml:/app/config.yaml:ro

  backend:
    build: .
    ports:
      - "8000:8000"
    environment:
      LITELLM_BASE_URL: http://litellm:4000
      LITELLM_API_KEY: ${LITELLM_MASTER_KEY}
    depends_on:
      - litellm

This makes dev/staging environments more consistent.

6) Best Practices (industry tips)

Separate app concerns from gateway concerns
- App: user auth, business logic
- Gateway: model routing, AI policy
Set strict timeouts Do not let AI requests hang too long. Ideally 10–30 seconds depending on the use case.
Use model tiering
- cheaper model for drafts/summarization
- premium model for heavy reasoning
Tag metadata per request Store user_id, feature, team so cost monitoring is more transparent.
Implement graceful degradation If all models fail, return an informative fallback response, not a raw 500.
Do not hardcode API keys Always use a secret manager or environment variables.
Audit prompts & outputs For production products, add safety filters (PII redaction, prompt injection guardrails).

7) Common Mistakes (that happen often)

Mistake 1 — Frontend directly hits provider

Consequence: API keys leak, auditing is difficult, policy enforcement is difficult.

Solution: route through backend + gateway.

Mistake 2 — No fallback

When a provider outage happens, AI features fully go down.

Solution: prepare at least 1 cross-vendor backup model.

Mistake 3 — No cost monitoring per feature

Bills increase but the team does not know which endpoint caused it.

Solution: metadata + cost tracking dashboard.

Mistake 4 — Blind retries

Unlimited retries worsen latency and costs.

Solution: limited retries + exponential backoff + circuit breaker.

Mistake 5 — Overly optimistic response parsing

Assuming choices[0].message.content always exists.

Solution: validate JSON structure before using it.

8) Advanced Tips (for those who want to go deeper)

Canary routing for new models Send 5–10% of traffic to a new model, compare quality + cost + latency.
Dynamic model selection Choose model based on prompt length, SLA, or user tier (free/pro).
Response caching For repeated questions (e.g., FAQ), caching can drastically reduce costs.
Integrated observability Integrate logs into observability platforms (e.g., Langfuse/MLflow/Helicone) for end-to-end traces.
A/B testing prompt templates Do not only switch models; sometimes proper prompt engineering can cut costs without reducing quality.
Multi-region failover If users are global, prepare region-aware routing for lower latency.

9) Summary & Next Steps

We have covered how to build a modern AI Gateway with LiteLLM:

Why an AI Gateway is important for production web applications
Core concepts of routing, fallback, unified interface
Full implementation with FastAPI + LiteLLM
Best practices to keep it secure, cost-efficient, and scalable
Common mistakes and advanced strategies

Next steps I recommend:

Deploy this architecture to staging.
Add per-user rate limiting on /api/chat endpoint.
Build an internal dashboard for metrics: latency p95, error rate, cost/feature.
Run outage simulation (turn off primary provider, ensure fallback works).

If this stage is complete, you already have an AI platform foundation that is far more production-ready than direct provider integration.

10) References

LiteLLM Docs (Getting Started): https://docs.litellm.ai/docs/
LiteLLM Proxy Quick Start: https://docs.litellm.ai/docs/proxy/quick_start
LiteLLM GitHub Repo: https://github.com/BerriAI/litellm
GitHub Trending: https://github.com/trending
X Explore: https://x.com/explore
Medium Software Engineering Tag: https://medium.com/tag/software-engineering

If you want, in the next advanced version we can discuss: “Implementing per-team budget enforcement + internal chargeback with LiteLLM” so AI cost control becomes more precise.