Observability and FinOps

LLM Gateway records operational and cost signals so teams can understand reliability, latency, usage, and spend across providers and models.

What to monitor

Area	Questions answered
Requests	How many requests are flowing through the gateway?
Latency	Is time spent inside the gateway or at the upstream provider?
Errors	Which provider, credential, model, or request type is failing?
Tokens	Which teams, users, providers, and models are consuming tokens?
Cost	What is estimated spend, actual provider spend, and variance?
Fallbacks	Are requests relying on backup routes more often than expected?
Rate limits	Are users or providers hitting configured limits?

Embedded dashboards

The LLM Gateway UI exposes operational views for:

Active provider count.
Request rate.
Error rate.
Latency percentiles.
Provider and model trends.
Credential-level usage.
FinOps and billing breakdowns where provider billing integration is configured.

Use the embedded dashboards for daily operations and Prometheus-compatible metrics for long-term alerting and retention.

Metrics endpoint

Prometheus-compatible metrics are available at:

/metrics/ai-gateway

The endpoint requires a bearer token generated by an operator. Do not reuse user session tokens for metrics scraping.

Example Prometheus scrape:

scrape_configs:
  - job_name: axiomcloud-ai-gateway
    scheme: https
    bearer_token_file: /etc/prometheus/axiomcloud-token
    static_configs:
      - targets: ["axiomcloud.example.com"]
    metrics_path: /metrics/ai-gateway
    scrape_interval: 30s

Example VictoriaMetrics scrape:

scrape_configs:
  - job_name: axiomcloud-ai-gateway
    scheme: https
    authorization:
      type: Bearer
      credentials_file: /etc/vmagent/axiomcloud-token
    static_configs:
      - targets: ["axiomcloud.example.com"]
    metrics_path: /metrics/ai-gateway
    scrape_interval: 30s

Key metrics

Metric	Purpose
`llm_gateway_requests_total`	Total gateway requests by organization, user, provider, credential, model, request type, and status.
`llm_gateway_request_duration_seconds`	End-to-end gateway request duration.
`llm_gateway_provider_duration_seconds`	Upstream provider duration.
`llm_gateway_active_requests`	Current active requests.
`llm_gateway_tokens_total`	Prompt and completion token volume.
`llm_gateway_estimated_cost_usd_total`	Estimated spend based on token usage and pricing.
`llm_gateway_actual_cost_usd`	Actual provider billing values when billing sync is configured.
`llm_gateway_errors_total`	Gateway and provider errors.
`llm_gateway_rate_limit_hits_total`	Requests blocked or delayed by rate limits.

Latency decomposition

Track both gateway duration and provider duration:

Gateway overhead: Time spent inside Axiom infrastructure for routing, authentication, accounting, and response handling.
Provider duration: Time spent waiting on the upstream model provider.

When latency rises:

Check provider duration first.
Compare affected providers and models.
Check whether fallback routing is active.
Check gateway overhead for local infrastructure pressure.
Review active requests and error metrics.

FinOps workflow

Use FinOps views and metrics to answer:

Which providers drive the most spend?
Which models are growing fastest?
Which credentials are tied to expensive workloads?
Are estimated costs close to actual provider invoices?
Are budget alerts firing before spend becomes a surprise?

Recommended operating cadence:

Review provider and model spend weekly.
Compare estimated and actual cost after each billing sync.
Investigate high variance by provider and credential.
Tune model selection and fallback order for cost-sensitive workloads.
Set budget thresholds for teams or environments with predictable usage.

Alert recommendations

Start with alerts for:

Error rate above baseline for a provider or model.
P95 or P99 latency above service target.
Provider duration increase without a matching gateway overhead increase.
Gateway overhead increase across all providers.
Active requests stuck above normal levels.
Token usage spike by organization or user.
Estimated cost spike over a short interval.
Rate limit hits above expected values.

Alerts should include provider, credential, model, organization, and request type labels where available.

Audit logs

Use audit logs to investigate configuration changes and operational events:

Credential creation, update, disablement, and deletion.
Fallback configuration changes.
Provider errors.
Fallback activation events.

Audit logs are tenant-scoped and should be part of incident review whenever routing behavior changes unexpectedly.