Routing and Reliability

LLM Gateway routes requests by model, credential configuration, provider availability, and fallback rules. The goal is to keep application code stable while operators control provider behavior centrally.

Request flow

Application
  -> Axiom LLM Gateway OpenAI-compatible endpoint
  -> credential and model selection
  -> optional load balancing
  -> provider request
  -> optional fallback retry
  -> response, analytics, audit logs, and metrics

Applications keep using the same gateway URL even when you rotate keys, add providers, disable a credential, or change fallback order.

Model-aware credential selection

When a request includes a model, the gateway uses that model to select an appropriate credential.

Typical behavior:

Prefer credentials whose default model matches the requested model.
If multiple matching credentials exist, use their configured weights.
If no model-specific credential matches, fall back to the available credentials for that provider according to configuration.

This allows teams to configure separate credentials for expensive, latency-sensitive, regional, or experimental models without changing application code.

Weighted load balancing

Weights split traffic across credentials in the same provider and model group. Each provider and model group has its own budget.

Example:

Credential	Provider	Default model	Weight
OpenAI primary	OpenAI	`gpt-4o`	`0.50`
OpenAI secondary	OpenAI	`gpt-4o`	`0.30`
OpenAI burst	OpenAI	`gpt-4o`	`0.20`

Requests for gpt-4o are split proportionally across those credentials. A separate gpt-4o-mini credential does not consume the same weight budget unless it is configured in that model group.

Use weights for:

Gradual provider or key migration.
Regional traffic balancing.
Splitting load across provider accounts.
Controlled rollout of new credentials.

Fallback routing

Fallback configurations define ordered backup paths. A fallback may point to another credential for the same provider or to a credential for a different provider.

Use fallback routing for:

Provider outages.
Provider rate limits.
Temporary credential disablement.
Regional provider incidents.
Business continuity for critical applications.

Recommended fallback pattern:

Primary provider and model.
Same-provider secondary credential.
Cross-provider equivalent model.
Lower-cost or lower-capacity emergency model.

Disabled credentials and providers

Disable credentials instead of deleting them when you need a reversible operational change. Disabled credentials can be bypassed by fallback routing, which lets operators remove a provider from active traffic without deploying application changes.

Delete credentials only when they are no longer needed and historical references are not required for operations.

Streaming reliability

Streaming chat completions use Server-Sent Events. For production streaming clients:

Treat data: [DONE] as the normal end of stream.
Handle partial output if the client disconnects.
Apply application-level retry carefully because repeated prompts can duplicate work or cost.
Track provider errors and gateway errors separately in dashboards.

Operational patterns

Provider migration

Add the new provider credential with a low weight.
Send test traffic.
Compare latency, error rate, output quality, and cost.
Increase weight gradually.
Keep the old provider as fallback until the migration is stable.

Key rotation

Add the replacement credential.
Give it a small weight or test-only model mapping.
Confirm traffic succeeds.
Move production weight to the replacement.
Disable the old credential.
Delete the old credential after the provider key has been revoked and retention needs are satisfied.

Incident response

Open Overview and Analytics to identify affected provider, model, and credential.
Disable the failing credential or provider route.
Confirm fallback traffic is flowing.
Watch latency, error rate, token volume, and cost.
Re-enable primary routing only after test requests are clean.

Reliability checklist

Critical models have at least one fallback credential.
Fallback order is documented and reviewed.
Load balancing weights are intentional and current.
Disabled credentials are reviewed periodically.
Dashboards separate gateway overhead from provider duration.
Alerting covers error rate, latency, active requests, and cost spikes.