Observability and FinOps
LLM Gateway records operational and cost signals so teams can understand reliability, latency, usage, and spend across providers and models.
What to monitor
| Area | Questions answered |
|---|---|
| Requests | How many requests are flowing through the gateway? |
| Latency | Is time spent inside the gateway or at the upstream provider? |
| Errors | Which provider, credential, model, or request type is failing? |
| Tokens | Which teams, users, providers, and models are consuming tokens? |
| Cost | What is estimated spend, actual provider spend, and variance? |
| Fallbacks | Are requests relying on backup routes more often than expected? |
| Rate limits | Are users or providers hitting configured limits? |
Embedded dashboards
The LLM Gateway UI exposes operational views for:
- Active provider count.
- Request rate.
- Error rate.
- Latency percentiles.
- Provider and model trends.
- Credential-level usage.
- FinOps and billing breakdowns where provider billing integration is configured.
Use the embedded dashboards for daily operations and Prometheus-compatible metrics for long-term alerting and retention.
Metrics endpoint
Prometheus-compatible metrics are available at:
/metrics/ai-gateway
The endpoint requires a bearer token generated by an operator. Do not reuse user session tokens for metrics scraping.
Example Prometheus scrape:
scrape_configs:
- job_name: axiomcloud-ai-gateway
scheme: https
bearer_token_file: /etc/prometheus/axiomcloud-token
static_configs:
- targets: ["axiomcloud.example.com"]
metrics_path: /metrics/ai-gateway
scrape_interval: 30s
Example VictoriaMetrics scrape:
scrape_configs:
- job_name: axiomcloud-ai-gateway
scheme: https
authorization:
type: Bearer
credentials_file: /etc/vmagent/axiomcloud-token
static_configs:
- targets: ["axiomcloud.example.com"]
metrics_path: /metrics/ai-gateway
scrape_interval: 30s
Key metrics
| Metric | Purpose |
|---|---|
llm_gateway_requests_total | Total gateway requests by organization, user, provider, credential, model, request type, and status. |
llm_gateway_request_duration_seconds | End-to-end gateway request duration. |
llm_gateway_provider_duration_seconds | Upstream provider duration. |
llm_gateway_active_requests | Current active requests. |
llm_gateway_tokens_total | Prompt and completion token volume. |
llm_gateway_estimated_cost_usd_total | Estimated spend based on token usage and pricing. |
llm_gateway_actual_cost_usd | Actual provider billing values when billing sync is configured. |
llm_gateway_errors_total | Gateway and provider errors. |
llm_gateway_rate_limit_hits_total | Requests blocked or delayed by rate limits. |
Latency decomposition
Track both gateway duration and provider duration:
- Gateway overhead: Time spent inside Axiom infrastructure for routing, authentication, accounting, and response handling.
- Provider duration: Time spent waiting on the upstream model provider.
When latency rises:
- Check provider duration first.
- Compare affected providers and models.
- Check whether fallback routing is active.
- Check gateway overhead for local infrastructure pressure.
- Review active requests and error metrics.
FinOps workflow
Use FinOps views and metrics to answer:
- Which providers drive the most spend?
- Which models are growing fastest?
- Which credentials are tied to expensive workloads?
- Are estimated costs close to actual provider invoices?
- Are budget alerts firing before spend becomes a surprise?
Recommended operating cadence:
- Review provider and model spend weekly.
- Compare estimated and actual cost after each billing sync.
- Investigate high variance by provider and credential.
- Tune model selection and fallback order for cost-sensitive workloads.
- Set budget thresholds for teams or environments with predictable usage.
Alert recommendations
Start with alerts for:
- Error rate above baseline for a provider or model.
- P95 or P99 latency above service target.
- Provider duration increase without a matching gateway overhead increase.
- Gateway overhead increase across all providers.
- Active requests stuck above normal levels.
- Token usage spike by organization or user.
- Estimated cost spike over a short interval.
- Rate limit hits above expected values.
Alerts should include provider, credential, model, organization, and request type labels where available.
Audit logs
Use audit logs to investigate configuration changes and operational events:
- Credential creation, update, disablement, and deletion.
- Fallback configuration changes.
- Provider errors.
- Fallback activation events.
Audit logs are tenant-scoped and should be part of incident review whenever routing behavior changes unexpectedly.