sre architecture
Guide for sre architecture
SRE Architecture & Observability Strategy
This document outlines the Site Reliability Engineering (SRE) foundation for Legends of Hastinapur.
Observability Pillars
1. Metrics (Prometheus)
Model: Pull-based metrics collection
Implementation:
- Client Library:
metricscrate +metrics-exporter-prometheus - Endpoint:
http://localhost:9000/metrics(client),http://localhost:9001/metrics(server) - Format: Prometheus exposition format
Key Metrics:
tick_duration_seconds: Game loop duration (histogram)entities_count: Number of active entities (gauge)fps: Frames per second - client only (gauge)network_packet_size_bytes: Bandwidth usage (histogram)active_connections: Number of connected players - server only (gauge)combat_events_total: Combat interactions (counter)quest_completions_total: Quest completions by quest_id (counter)
Prometheus Configuration: See prometheus_config.md
2. Structured Logging (Tracing)
Library:
tracing + tracing-subscriberLog Formats:
- Development: Pretty printing with
fmt::layerlet fmt_layer = tracing_subscriber::fmt::layer() .with_target(false) .with_thread_ids(true) .with_level(true); - Production: JSON with
json::layerfor ingestion by Loki/Elasticsearchlet fmt_layer = tracing_subscriber::fmt::layer() .json() .with_current_span(true) .with_span_list(true);
Log Levels:
TRACE: Detailed debugging (disabled in production)DEBUG: Development diagnosticsINFO: Significant game events (player login, quest complete, item drop)WARN: Recoverable issues (packet drop, interpolation correction, retry attempts)ERROR: System failures (DB connection lost, asset load failure)
Environment Variable:
RUST_LOG=info,legends_client=debug3. Error Tracking (Sentry)
Integration:
sentry + sentry-tracingScope:
- Panic capturing (automatic)
- Error-level log events (via
sentry-tracinglayer) - Custom error contexts (breadcrumbs)
Configuration:
let sentry_dsn = std::env::var("SENTRY_DSN").unwrap_or_default();
let _guard = sentry::init((sentry_dsn, sentry::ClientOptions {
release: sentry::release_name!(),
environment: Some("production".into()),
..Default::default()
}));Environment Variable:
SENTRY_DSN=https://<key>@sentry.io/<project>4. Distributed Tracing (Future - OpenTelemetry)
Goal: Trace requests across Client → Game Server → Database
Integration Plan:
- Add
tracing-opentelemetrylayer - Export to Tempo or Jaeger backend
- Correlate spans across service boundaries
Example Span:
#[tracing::instrument(skip(inventory))]
fn process_item_use(player_id: u64, item_id: ItemId, inventory: &mut Inventory) {
// Automatic span creation with player_id and item_id
}Infrastructure Components
Local Development Stack
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=trueProduction Stack (Future)
- Metrics: Prometheus + Thanos (long-term storage)
- Logs: Loki + Grafana
- Traces: Tempo + Grafana
- Dashboards: Grafana
- Alerting: AlertManager
Alerting Rules (Future)
Critical Alerts
- High Crash Rate: > 1% of sessions crash within 5 minutes
- Server Down: No metrics received for 1 minute
- Database Unavailable: Connection pool exhausted
Warning Alerts
- Low TPS: Server TPS < 20 for 5 minutes
- High Latency: P95 tick duration > 50ms
- Memory Leak: Memory usage increasing > 10% per hour
Metrics Collection Pattern
use metrics::{counter, histogram, gauge};
// Counter: Increment-only
counter!("quest_completions_total", "quest_id" => quest_id).increment(1);
// Histogram: Distribution of values
histogram!("tick_duration_seconds").record(duration.as_secs_f64());
// Gauge: Point-in-time value
gauge!("entities_count").set(entity_count as f64);Logging Best Practices
Use Structured Fields
// Good: Structured
info!(player_id = %player.id, quest_id = %quest.id, "Quest completed");
// Bad: String interpolation
info!("Quest {} completed by player {}", quest.id, player.id);Use Spans for Context
let span = info_span!("combat_tick", attacker = %attacker_id, target = %target_id);
let _enter = span.enter();
// All logs within this scope inherit the span context
debug!("Calculating damage");
info!(damage = %final_damage, "Damage dealt");Avoid Logging in Hot Paths
// Bad: Logs every frame
for entity in entities.iter() {
trace!("Processing entity {:?}", entity); // Too verbose
}
// Good: Log summaries
debug!(entity_count = entities.len(), "Processed entities");Integration with Bevy
See bevy_telemetry_integration.md for details on integrating this observability stack with Bevy's plugin system.
CI/CD Integration
Automated Checks
cargo check: Compilationcargo clippy: Lintingcargo test: Unit testscargo audit: Security vulnerabilitiescargo deny: License compliance
Deployment Pipeline
- Build release binary
- Run integration tests
- Deploy to staging
- Smoke tests (verify metrics endpoint, Sentry connection)
- Deploy to production
- Monitor error rates and latency
Cost Considerations
Free Tier Limits
- Sentry: 5,000 events/month
- Prometheus: Self-hosted (storage costs only)
- Grafana Cloud: 10k metrics, 50GB logs (free tier)
Optimization Strategies
- Sample high-volume metrics (e.g., 10% of tick durations)
- Use log levels to reduce volume in production
- Aggregate metrics before export (e.g., per-minute summaries)
Related Documentation
- tracing_panic_fix.md: Fixing tracing initialization conflicts
- prometheus_config.md: Prometheus scrape configuration