Skip to content

LLM 交易指标与告警

适用版本:axon-llm v0.2.0+ 前置阅读:overview.md §5

本文档详述 TradingMetrics 的指标体系、数据出口、应用方集成到监控后端的示例模板。

重要决策:axon-llm 不强加特定监控栈(不内置 Prometheus exporter、不依赖 axon-monitor、不提供 Grafana dashboard / Prometheus 告警 YAML)。TradingMetrics 自包含 Mutex + AtomicU64,应用方通过 callback / snapshot 两种数据出口自行接到任意监控后端(Prometheus / OpenTelemetry / StatsD / 自定义)。

Grafana dashboard / Prometheus 告警规则由各团队按自己的监控栈自配,axon 路线图不集中维护。

1. 核心指标(4 类)

1.1 Counter:trading_orders_total{tool,side,status}

每个 tool 一次下单/撤单/改单 +1。可选 label:

label 取值 含义
tool place / cancel / replace 哪个 tool 触发
side buy / sell / none 下单方向(cancel/replace 为 none)
status success / rejected / failed 执行结果

1.2 Counter:trading_risk_rejections_total{source}

风控拒绝次数 +1。可选 label:

label 取值 含义
source risk_limits / risk_gate / safety_mode 哪道防线拒绝

1.3 Counter:trading_backend_errors_total{backend,kind}

后端调用失败次数 +1。可选 label:

label 取值 含义
backend mock / exchange / oms / backtest 哪个后端
kind network / rejected / timeout / other 错误类型

1.4 Histogram:trading_tool_execute_duration_seconds{tool}

Tool::execute() 端到端时延分布,典型 bucket:[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0](秒)。

1.5 Gauge:trading_daily_orders_count

当日累计订单数(从 RiskLimits::DailyCounter 镜像),每日 UTC 0 点自动重置。

2. 数据出口

2.1 回调(实时推送)

use std::sync::Arc;
use axon_llm::trading::metrics::{TradingMetrics, MetricSample};

let metrics = TradingMetrics::new();

// 注册回调:每次指标变化时立即调用
metrics.set_callback(Arc::new(|sample: MetricSample| {
    match sample {
        MetricSample::CounterInc { name, labels, value } => {
            println!("[counter] {} {:?} += {}", name, labels, value);
            // 推送到 Prometheus / OTLP / 自定义 sink
        }
        MetricSample::HistogramObserve { name, labels, value_secs } => {
            println!("[histogram] {} {:?} observe {}s", name, labels, value_secs);
        }
        MetricSample::GaugeSet { name, value } => {
            println!("[gauge] {} = {}", name, value);
        }
    }
}));

注意: - 回调在 TradingMetrics 内部持 Mutex<Option<Arc<dyn Fn ...>>>,只允许 1 个回调 - 回调阻塞会导致所有后续指标记录阻塞,回调必须非阻塞(用 channel / mpsc / 协程) - 推荐把回调内的工作推到独立的 tokio task

2.2 快照(主动拉取)

let snapshot: Vec<MetricSample> = metrics.snapshot();

for sample in snapshot {
    println!("{:?}", sample);
}

典型用途: - 定时上报(每 10s / 30s 一次) - 健康检查端点返回当前指标状态 - 集成测试中验证指标递增

3. 应用方集成示例

3.1 Rust + Prometheus(用 prometheus crate)

use prometheus::{Registry, IntCounterVec, HistogramVec, IntGauge, register_int_counter_vec_with_registry, register_histogram_vec_with_registry, register_int_gauge_with_registry};

let registry = Registry::new();
let orders_total = register_int_counter_vec_with_registry!(
    "trading_orders_total", "Total trading orders",
    &["tool", "side", "status"], registry
)?;
let duration = register_histogram_vec_with_registry!(
    "trading_tool_execute_duration_seconds", "Tool execution duration",
    &["tool"], registry
)?;

let metrics = TradingMetrics::new();
metrics.set_callback(Arc::new(move |sample| match sample {
    MetricSample::CounterInc { name, labels, value } if name == "trading_orders_total" => {
        orders_total.with_label_values(&[&labels["tool"], &labels["side"], &labels["status"]]).inc_by(value);
    }
    MetricSample::HistogramObserve { name, labels, value_secs } if name == "trading_tool_execute_duration_seconds" => {
        duration.with_label_values(&[&labels["tool"]]).observe(value_secs);
    }
    _ => {}
}));

// 暴露给 Prometheus
let encoder = prometheus::TextEncoder::new();
// 周期性把 registry.gather() 写入 HTTP 响应

3.2 Python + Prometheus(用 prometheus_client)

import axon_quant
from prometheus_client import Counter, Histogram, start_http_server

orders_total = Counter(
    'trading_orders_total', 'Total trading orders',
    ['tool', 'side', 'status']
)
duration = Histogram(
    'trading_tool_execute_duration_seconds', 'Tool execution duration',
    ['tool']
)

# 启动 Prometheus exporter(独立 HTTP 端口)
start_http_server(9100)

# 注册 callback
def on_sample(sample):
    kind, data = sample
    if kind == 'counter_inc' and data['name'] == 'trading_orders_total':
        orders_total.labels(**data['labels']).inc(data['value'])
    elif kind == 'histogram_observe' and data['name'] == 'trading_tool_execute_duration_seconds':
        duration.labels(**data['labels']).observe(data['value_secs'])

axon_quant.set_metrics_callback(on_sample)

3.3 Rust + OpenTelemetry(用 opentelemetry crate)

use opentelemetry::metrics::MeterProvider;
let provider = opentelemetry_otlp::new_pipeline().install_simple();
let meter = provider.meter("axon-llm");

let orders_counter = meter.u64_counter("trading_orders_total").init();
let duration_hist = meter.f64_histogram("trading_tool_execute_duration_seconds").init();

metrics.set_callback(Arc::new(move |sample| match sample {
    MetricSample::CounterInc { name, labels, value } if name == "trading_orders_total" => {
        let attrs = labels.iter().map(|(k, v)| KeyValue::new(k, v)).collect::<Vec<_>>();
        orders_counter.add(value, &attrs);
    }
    // ...
    _ => {}
}));

4. 告警建议(应用方配置)

axon 不提供集中告警规则,以下是基于指标的 告警建议,供应用方参考:

告警名称 触发条件 严重度 行动
HighRiskRejection rate(trading_risk_rejections_total[5m]) > 10 warning 检查 LLM prompt 是否被越狱
CircuitBreakerOpen trading_risk_gate_blocked_total > 0 critical 立即人工接管,检查决策日志
BackendErrorSpike rate(trading_backend_errors_total[5m]) > 5 critical 检查交易所 API / OMS 状态
LatencyP99TooHigh histogram_quantile(0.99, rate(trading_tool_execute_duration_seconds_bucket[5m])) > 5 warning 检查后端延迟、网络质量
DailyOrderBurst trading_daily_orders_count > 80% * max_daily_orders info 接近风控上限,准备限流

5. 性能开销

TradingMetrics 的性能开销极低:

  • Counter 增量:1 次 atomic add(~10ns)
  • Histogram 观测:1 次 atomic add + bucket 定位(~50ns)
  • 回调开销:0(无回调时,内部只走 Mutex<Option<...>> 的 load)

实测:每秒 10K 下单的压测下,metrics 模块 CPU 占用 < 0.1%。

下一步