Axiom StudioAXIOMSTUDIO
All docs

Observability and FinOps

LLM Gateway records operational and cost signals so teams can understand reliability, latency, usage, and spend across providers and models.

What to monitor

AreaQuestions answered
RequestsHow many requests are flowing through the gateway?
LatencyIs time spent inside the gateway or at the upstream provider?
ErrorsWhich provider, credential, model, or request type is failing?
TokensWhich teams, users, providers, and models are consuming tokens?
CostWhat is estimated spend, actual provider spend, and variance?
FallbacksAre requests relying on backup routes more often than expected?
Rate limitsAre users or providers hitting configured limits?

Embedded dashboards

The LLM Gateway UI exposes operational views for:

  • Active provider count.
  • Request rate.
  • Error rate.
  • Latency percentiles.
  • Provider and model trends.
  • Credential-level usage.
  • FinOps and billing breakdowns where provider billing integration is configured.

Use the embedded dashboards for daily operations and Prometheus-compatible metrics for long-term alerting and retention.

Metrics endpoint

Prometheus-compatible metrics are available at:

/metrics/ai-gateway

The endpoint requires a bearer token generated by an operator. Do not reuse user session tokens for metrics scraping.

Example Prometheus scrape:

scrape_configs:
  - job_name: axiomcloud-ai-gateway
    scheme: https
    bearer_token_file: /etc/prometheus/axiomcloud-token
    static_configs:
      - targets: ["axiomcloud.example.com"]
    metrics_path: /metrics/ai-gateway
    scrape_interval: 30s

Example VictoriaMetrics scrape:

scrape_configs:
  - job_name: axiomcloud-ai-gateway
    scheme: https
    authorization:
      type: Bearer
      credentials_file: /etc/vmagent/axiomcloud-token
    static_configs:
      - targets: ["axiomcloud.example.com"]
    metrics_path: /metrics/ai-gateway
    scrape_interval: 30s

Key metrics

MetricPurpose
llm_gateway_requests_totalTotal gateway requests by organization, user, provider, credential, model, request type, and status.
llm_gateway_request_duration_secondsEnd-to-end gateway request duration.
llm_gateway_provider_duration_secondsUpstream provider duration.
llm_gateway_active_requestsCurrent active requests.
llm_gateway_tokens_totalPrompt and completion token volume.
llm_gateway_estimated_cost_usd_totalEstimated spend based on token usage and pricing.
llm_gateway_actual_cost_usdActual provider billing values when billing sync is configured.
llm_gateway_errors_totalGateway and provider errors.
llm_gateway_rate_limit_hits_totalRequests blocked or delayed by rate limits.

Latency decomposition

Track both gateway duration and provider duration:

  • Gateway overhead: Time spent inside Axiom infrastructure for routing, authentication, accounting, and response handling.
  • Provider duration: Time spent waiting on the upstream model provider.

When latency rises:

  1. Check provider duration first.
  2. Compare affected providers and models.
  3. Check whether fallback routing is active.
  4. Check gateway overhead for local infrastructure pressure.
  5. Review active requests and error metrics.

FinOps workflow

Use FinOps views and metrics to answer:

  • Which providers drive the most spend?
  • Which models are growing fastest?
  • Which credentials are tied to expensive workloads?
  • Are estimated costs close to actual provider invoices?
  • Are budget alerts firing before spend becomes a surprise?

Recommended operating cadence:

  1. Review provider and model spend weekly.
  2. Compare estimated and actual cost after each billing sync.
  3. Investigate high variance by provider and credential.
  4. Tune model selection and fallback order for cost-sensitive workloads.
  5. Set budget thresholds for teams or environments with predictable usage.

Alert recommendations

Start with alerts for:

  • Error rate above baseline for a provider or model.
  • P95 or P99 latency above service target.
  • Provider duration increase without a matching gateway overhead increase.
  • Gateway overhead increase across all providers.
  • Active requests stuck above normal levels.
  • Token usage spike by organization or user.
  • Estimated cost spike over a short interval.
  • Rate limit hits above expected values.

Alerts should include provider, credential, model, organization, and request type labels where available.

Audit logs

Use audit logs to investigate configuration changes and operational events:

  • Credential creation, update, disablement, and deletion.
  • Fallback configuration changes.
  • Provider errors.
  • Fallback activation events.

Audit logs are tenant-scoped and should be part of incident review whenever routing behavior changes unexpectedly.