Routing and Reliability
LLM Gateway routes requests by model, credential configuration, provider availability, and fallback rules. The goal is to keep application code stable while operators control provider behavior centrally.
Request flow
Application
-> Axiom LLM Gateway OpenAI-compatible endpoint
-> credential and model selection
-> optional load balancing
-> provider request
-> optional fallback retry
-> response, analytics, audit logs, and metrics
Applications keep using the same gateway URL even when you rotate keys, add providers, disable a credential, or change fallback order.
Model-aware credential selection
When a request includes a model, the gateway uses that model to select an appropriate credential.
Typical behavior:
- Prefer credentials whose default model matches the requested model.
- If multiple matching credentials exist, use their configured weights.
- If no model-specific credential matches, fall back to the available credentials for that provider according to configuration.
This allows teams to configure separate credentials for expensive, latency-sensitive, regional, or experimental models without changing application code.
Weighted load balancing
Weights split traffic across credentials in the same provider and model group. Each provider and model group has its own budget.
Example:
| Credential | Provider | Default model | Weight |
|---|---|---|---|
| OpenAI primary | OpenAI | gpt-4o | 0.50 |
| OpenAI secondary | OpenAI | gpt-4o | 0.30 |
| OpenAI burst | OpenAI | gpt-4o | 0.20 |
Requests for gpt-4o are split proportionally across those credentials. A separate gpt-4o-mini credential does not consume the same weight budget unless it is configured in that model group.
Use weights for:
- Gradual provider or key migration.
- Regional traffic balancing.
- Splitting load across provider accounts.
- Controlled rollout of new credentials.
Fallback routing
Fallback configurations define ordered backup paths. A fallback may point to another credential for the same provider or to a credential for a different provider.
Use fallback routing for:
- Provider outages.
- Provider rate limits.
- Temporary credential disablement.
- Regional provider incidents.
- Business continuity for critical applications.
Recommended fallback pattern:
- Primary provider and model.
- Same-provider secondary credential.
- Cross-provider equivalent model.
- Lower-cost or lower-capacity emergency model.
Disabled credentials and providers
Disable credentials instead of deleting them when you need a reversible operational change. Disabled credentials can be bypassed by fallback routing, which lets operators remove a provider from active traffic without deploying application changes.
Delete credentials only when they are no longer needed and historical references are not required for operations.
Streaming reliability
Streaming chat completions use Server-Sent Events. For production streaming clients:
- Treat
data: [DONE]as the normal end of stream. - Handle partial output if the client disconnects.
- Apply application-level retry carefully because repeated prompts can duplicate work or cost.
- Track provider errors and gateway errors separately in dashboards.
Operational patterns
Provider migration
- Add the new provider credential with a low weight.
- Send test traffic.
- Compare latency, error rate, output quality, and cost.
- Increase weight gradually.
- Keep the old provider as fallback until the migration is stable.
Key rotation
- Add the replacement credential.
- Give it a small weight or test-only model mapping.
- Confirm traffic succeeds.
- Move production weight to the replacement.
- Disable the old credential.
- Delete the old credential after the provider key has been revoked and retention needs are satisfied.
Incident response
- Open Overview and Analytics to identify affected provider, model, and credential.
- Disable the failing credential or provider route.
- Confirm fallback traffic is flowing.
- Watch latency, error rate, token volume, and cost.
- Re-enable primary routing only after test requests are clean.
Reliability checklist
- Critical models have at least one fallback credential.
- Fallback order is documented and reviewed.
- Load balancing weights are intentional and current.
- Disabled credentials are reviewed periodically.
- Dashboards separate gateway overhead from provider duration.
- Alerting covers error rate, latency, active requests, and cost spikes.