Scenario 3 — Production Deployment and Monitoring¶
Related examples: -
examples/09_exchange/binance_demo.py— Exchange integration -examples/12_inference/inference_demo.py— Inference engine -examples/14_ensemble/ensemble_demo.py— Ensemble learning -examples/15_explain/explain_demo.py— Explainability
This document covers the complete process of moving AXON quantitative trading framework from laboratory to production, including inference backend selection, model hot updates, explainability auditing, model ensemble dynamic weighting, and exchange integration.
1. Inference Engine Three-Backend Selection Guide¶
AXON's inference engine (axon-inference) supports three backends, each suited for different deployment scenarios.
1.1 Backend Comparison Table¶
| Dimension | ONNX | tch (PyTorch C++) | Candle (Pure Rust) |
|---|---|---|---|
| Dependencies | ort (ONNX Runtime) | tch-rs (LibTorch) | candle-core + candle-nn |
| Binary Size | Medium (+ ONNX Runtime) | Large (+ LibTorch) | Small (Pure Rust) |
| Startup Speed | Fast | Medium | Very Fast |
| CPU Inference Latency | < 500µs | < 1ms | < 500µs |
| GPU Support | CUDA / TensorRT | CUDA / ROCm | CUDA (Experimental) |
| Model Format | .onnx | .pt / .torchscript | .safetensors |
| Hot Update Support | replace_session | replace_session | load(new_path) |
| Use Case | Production preferred | Research/fast iteration | Minimal deployment without Python |
1.2 Backend Selection Decision Tree¶
Need GPU acceleration?
├── Yes → Need TensorRT?
│ ├── Yes → ONNX (Level3 optimization + TensorRT EP)
│ └── No → tch (CUDA) or ONNX (CUDA EP)
└── No → Mind Python dependency?
├── Yes → Candle (pure Rust, zero Python)
└── No → ONNX (CPU, most mature ecosystem)
1.3 Configuration Example¶
from axon_quant import InferenceBackend, Device, ModelConfig
# ONNX production configuration
onnx_config = ModelConfig(
path="models/production.onnx",
backend=InferenceBackend.ONNX,
device=Device.CUDA(0), # Use first GPU
input_shape=[1, 64, 128], # [batch, seq_len, features]
output_dim=3, # Buy / Sell / Hold
fp16=True, # Enable FP16 inference
num_threads=4, # ONNX Runtime threads
)
# Candle zero-dependency configuration
candle_config = ModelConfig(
path="models/production.safetensors",
backend=InferenceBackend.CANDLE,
device=Device.CPU,
input_shape=[1, 4, 1], # input_dim = 1*4*1 = 4
output_dim=3,
fp16=False, # Candle doesn't support FP16 yet
num_threads=4,
)
2. Model Hot Update¶
In production environments, models need to be updated without restarting services. AXON achieves atomic replacement via ModelHotReloader + notify file monitoring.
2.1 Core Mechanism¶
File system monitoring (notify)
│
▼
Detect model file change
│
▼
Debounce processing (500ms aggregation of consecutive events)
│
▼
Calculate new model SHA256 checksum
│
▼
Acquire backend write lock → load new model → release write lock
│
▼
Atomic version increment → broadcast via watch channel
2.2 Hot Update Code Example¶
import asyncio
from axon_quant import ModelHotReloader, OnnxBackend, ModelConfig
async def setup_hot_reload():
"""
Configure model hot update system.
When the model file changes, automatically:
1. Verify SHA256 checksum
2. Acquire write lock
3. Load new model
4. Broadcast version change
"""
config = ModelConfig(
path="models/production.onnx",
backend=InferenceBackend.ONNX,
device=Device.CUDA(0),
)
engine = OnnxBackend(config)
engine.load(Path(config.path))
reloader = ModelHotReloader(engine, config)
# Start file watcher
reloader.spawn_watcher()
# Subscribe to version changes
version_rx = reloader.subscribe()
async def watch_versions():
while True:
await version_rx.changed()
version = version_rx.borrow()
print(f"Model updated to version {version}")
asyncio.create_task(watch_versions())
return reloader
3. Explainability Audit¶
AXON provides built-in explainability via axon-explain, supporting SHAP feature attribution, counterfactual explanations, and decision reports.
3.1 SHAP Feature Attribution¶
from axon_quant.explain import KernelSHAP
explainer = KernelSHAP(model)
# Explain a single prediction
explanation = explainer.explain(
observation=obs,
action=predicted_action,
background_data=background_samples,
)
# Visualize feature importance
print("Feature importance:")
for feature, importance in sorted(
zip(explanation.feature_names, explanation.feature_importances),
key=lambda x: abs(x[1]),
reverse=True,
):
print(f" {feature}: {importance:.4f}")
3.2 Counterfactual Explanations¶
from axon_quant.explain import CounterfactualGenerator
generator = CounterfactualGenerator(model)
# Generate counterfactual: "What if the action was different?"
counterfactuals = generator.generate(
observation=obs,
original_action=original_action,
n_samples=100,
)
# Analyze: "What would need to change for a different outcome?"
for cf in counterfactuals:
print(f"Change {cf.feature_name} by {cf.delta:.4f}")
print(f" Original: {cf.original_value:.4f}")
print(f" Counterfactual: {cf.counterfactual_value:.4f}")
3.3 Decision Report Generation¶
from axon_quant.explain import ReportGenerator
generator = ReportGenerator(model)
# Generate comprehensive decision report
report = generator.generate_report(
observation=obs,
action=predicted_action,
include_shap=True,
include_counterfactuals=True,
include_feature_importance=True,
)
# Export as HTML/PDF
report.export_html("decision_report.html")
report.export_pdf("decision_report.pdf")
4. Model Ensemble Dynamic Weighting¶
AXON's axon-ensemble module provides dynamic weight adjustment based on real-time performance monitoring.
4.1 DynamicWeightedEnsemble¶
from axon_quant.ensemble import DynamicWeightedEnsemble
# Create ensemble with multiple models
ensemble = DynamicWeightedEnsemble(
models=[ppo_model, sac_model, rule_based_model],
initial_weights=[0.4, 0.4, 0.2],
performance_window=100, # Last 100 trades for weight calculation
)
# Update weights based on performance
ensemble.update_weights(
performances=[ppo_sharpe, sac_sharpe, rule_sharpe]
)
# Get weighted prediction
action = ensemble.predict(observation)
4.2 Performance Monitoring¶
# Monitor ensemble performance
metrics = ensemble.get_metrics()
print(f"Ensemble Sharpe: {metrics['sharpe']:.2f}")
print(f"Model weights: {ensemble.weights}")
print(f"Active models: {metrics['active_count']}")
5. Exchange Integration¶
AXON provides production-ready exchange adapters for Binance and OKX.
5.1 Binance Adapter Configuration¶
from axon_quant.exchange import BinanceAdapter, ExchangeConfig
config = ExchangeConfig(
api_key="YOUR_API_KEY",
api_secret="YOUR_API_SECRET",
testnet=False, # Production
rate_limit=RateLimitConfig(
requests_per_second=10,
orders_per_minute=60,
),
reconnect=ReconnectConfig(
max_retries=10,
initial_backoff_ms=500,
),
)
adapter = BinanceAdapter(config)
await adapter.connect()
5.2 Production Order Flow¶
# Place order with risk checks
order = Order(
symbol="BTCUSDT",
side=Side.Buy,
order_type=OrderType.Limit,
price=50000.0,
quantity=0.001,
)
# Risk check before submission
if risk_engine.check_order(order):
order_id = await adapter.place_order(order)
print(f"Order placed: {order_id}")
else:
print("Order rejected by risk engine")
6. Monitoring and Alerting¶
6.1 Metrics Collection¶
from axon_quant.metrics import MetricsCollector
collector = MetricsCollector()
# Record trading metrics
collector.record_order(side="buy", symbol="BTCUSDT", quantity=0.001)
collector.record_latency(operation="place_order", duration_ms=45.2)
collector.record_pnl(pnl=150.0, symbol="BTCUSDT")
# Export to Prometheus
collector.export_prometheus(port=9100)
6.2 Alert Rules¶
# Example Prometheus alert rules
groups:
- name: trading_alerts
rules:
- alert: HighLatency
expr: trading_order_latency_seconds > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Trading order latency is high"
- alert: LargeDrawdown
expr: trading_max_drawdown > 0.1
for: 1m
labels:
severity: critical
annotations:
summary: "Portfolio drawdown exceeds 10%"
7. Deployment Checklist¶
Pre-Deployment¶
- Run full test suite:
cargo test --workspace - Verify configuration:
axon validate-config -c production.toml - Load test with simulated traffic
- Set up monitoring and alerting
- Configure backup and rollback procedures
Deployment¶
- Deploy to staging environment first
- Run integration tests against staging
- Gradual rollout (canary deployment)
- Monitor metrics for anomalies
- Keep previous version ready for rollback
Post-Deployment¶
- Verify all trading pairs are operational
- Check latency metrics are within SLA
- Monitor error rates
- Review PnL and position metrics
- Document any issues encountered
Next Steps¶
- Operations Runbook — Deployment and troubleshooting
- Risk & Safety — Risk control configuration
- Metrics & Alerting — Monitoring setup