Skip to content

LLM Trading Metrics & Alerting

Applicable version: axon-llm v0.2.0+ Prerequisites: overview.md §5

This document details TradingMetrics's metric system, data outlets, and integration templates for applications to connect to monitoring backends.

Key Decision: axon-llm does not impose a specific monitoring stack (no built-in Prometheus exporter, no axon-monitor dependency, no Grafana dashboard / Prometheus alerting YAML). TradingMetrics is self-contained with Mutex + AtomicU64, applications connect to any monitoring backend (Prometheus / OpenTelemetry / StatsD / custom) via callback / snapshot data outlets.

Grafana dashboards / Prometheus alerting rules are configured by each team per their monitoring stack; axon roadmap does not centrally maintain them.

1. Core Metrics (4 Types)

1.1 Counter: trading_orders_total{tool,side,status}

Increments +1 for each order/cancel/modify per tool. Optional labels:

Label Values Description
tool place / cancel / replace Which tool triggered
side buy / sell / none Order direction (cancel/replace = none)
status success / rejected / failed Execution result

1.2 Counter: trading_risk_rejections_total{source}

Increments +1 for each risk control rejection. Optional labels:

Label Values Description
source risk_limits / risk_gate / safety_mode Which defense line rejected

1.3 Counter: trading_backend_errors_total{backend,kind}

Increments +1 for each backend call failure. Optional labels:

Label Values Description
backend mock / exchange / oms / backtest Which backend
kind network / rejected / timeout / other Error type

1.4 Histogram: trading_tool_execute_duration_seconds{tool}

Tool::execute() end-to-end latency distribution, typical buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0] (seconds).

1.5 Gauge: trading_daily_orders_count

Daily cumulative order count (mirrored from RiskLimits::DailyCounter), auto-resets daily at UTC 0:00.

2. Data Outlets

2.1 Callback (Real-time Push)

use std::sync::Arc;
use axon_llm::trading::metrics::{TradingMetrics, MetricSample};

let metrics = TradingMetrics::new();

// Register callback: called immediately on each metric change
metrics.set_callback(Arc::new(|sample: MetricSample| {
    match sample {
        MetricSample::CounterInc { name, labels, value } => {
            println!("[counter] {} {:?} += {}", name, labels, value);
            // Push to Prometheus / OTLP / custom sink
        }
        MetricSample::HistogramObserve { name, labels, value_secs } => {
            println!("[histogram] {} {:?} observe {}s", name, labels, value_secs);
        }
        MetricSample::GaugeSet { name, value } => {
            println!("[gauge] {} = {}", name, value);
        }
    }
}));

Notes: - Callback holds Mutex<Option<Arc<dyn Fn ...>>> inside TradingMetrics, only 1 callback allowed - Callback blocking causes all subsequent metric recording to block — callbacks must be non-blocking (use channel / mpsc / coroutine) - Recommended: push callback work to a separate tokio task

2.2 Snapshot (On-Demand Pull)

let snapshot: Vec<MetricSample> = metrics.snapshot();

for sample in snapshot {
    println!("{:?}", sample);
}

Typical Use Cases: - Periodic reporting (every 10s / 30s) - Health check endpoint returns current metric state - Integration tests verify metric increments

3. Application Integration Examples

3.1 Rust + Prometheus (using prometheus crate)

use prometheus::{Registry, IntCounterVec, HistogramVec, IntGauge, register_int_counter_vec_with_registry, register_histogram_vec_with_registry, register_int_gauge_with_registry};

let registry = Registry::new();
let orders_total = register_int_counter_vec_with_registry!(
    "trading_orders_total", "Total trading orders",
    &["tool", "side", "status"], registry
)?;
let duration = register_histogram_vec_with_registry!(
    "trading_tool_execute_duration_seconds", "Tool execution duration",
    &["tool"], registry
)?;

let metrics = TradingMetrics::new();
metrics.set_callback(Arc::new(move |sample| match sample {
    MetricSample::CounterInc { name, labels, value } if name == "trading_orders_total" => {
        orders_total.with_label_values(&[&labels["tool"], &labels["side"], &labels["status"]]).inc_by(value);
    }
    MetricSample::HistogramObserve { name, labels, value_secs } if name == "trading_tool_execute_duration_seconds" => {
        duration.with_label_values(&[&labels["tool"]]).observe(value_secs);
    }
    _ => {}
}));

// Expose to Prometheus
let encoder = prometheus::TextEncoder::new();
// Periodically write registry.gather() to HTTP response

3.2 Python + Prometheus (using prometheus_client)

import axon_quant
from prometheus_client import Counter, Histogram, start_http_server

orders_total = Counter(
    'trading_orders_total', 'Total trading orders',
    ['tool', 'side', 'status']
)
duration = Histogram(
    'trading_tool_execute_duration_seconds', 'Tool execution duration',
    ['tool']
)

# Start Prometheus exporter (separate HTTP port)
start_http_server(9100)

# Register callback
def on_sample(sample):
    kind, data = sample
    if kind == 'counter_inc' and data['name'] == 'trading_orders_total':
        orders_total.labels(**data['labels']).inc(data['value'])
    elif kind == 'histogram_observe' and data['name'] == 'trading_tool_execute_duration_seconds':
        duration.labels(**data['labels']).observe(data['value_secs'])

axon_quant.set_metrics_callback(on_sample)

3.3 Rust + OpenTelemetry (using opentelemetry crate)

use opentelemetry::metrics::MeterProvider;
let provider = opentelemetry_otlp::new_pipeline().install_simple();
let meter = provider.meter("axon-llm");

let orders_counter = meter.u64_counter("trading_orders_total").init();
let duration_hist = meter.f64_histogram("trading_tool_execute_duration_seconds").init();

metrics.set_callback(Arc::new(move |sample| match sample {
    MetricSample::CounterInc { name, labels, value } if name == "trading_orders_total" => {
        let attrs = labels.iter().map(|(k, v)| KeyValue::new(k, v)).collect::<Vec<_>>();
        orders_counter.add(value, &attrs);
    }
    // ...
    _ => {}
}));

4. Alert Recommendations (Application Configuration)

axon does not provide centralized alerting rules. Below are alert recommendations based on metrics for application reference:

Alert Name Trigger Condition Severity Action
HighRiskRejection rate(trading_risk_rejections_total[5m]) > 10 warning Check if LLM prompt has been jailbroken
CircuitBreakerOpen trading_risk_gate_blocked_total > 0 critical Immediate human takeover, check decision logs
BackendErrorSpike rate(trading_backend_errors_total[5m]) > 5 critical Check exchange API / OMS status
LatencyP99TooHigh histogram_quantile(0.99, rate(trading_tool_execute_duration_seconds_bucket[5m])) > 5 warning Check backend latency, network quality
DailyOrderBurst trading_daily_orders_count > 80% * max_daily_orders info Approaching risk limit, prepare rate limiting

5. Performance Overhead

TradingMetrics performance overhead is minimal:

  • Counter increment: 1 atomic add (~10ns)
  • Histogram observation: 1 atomic add + bucket lookup (~50ns)
  • Callback overhead: 0 (without callback, internal only loads Mutex<Option<...>>)

Measured: Under 10K orders/sec stress test, metrics module CPU usage < 0.1%.

Next Steps