sre architecture

Guide for sre architecture

SRE Architecture & Observability Strategy

This document outlines the Site Reliability Engineering (SRE) foundation for Legends of Hastinapur.

Observability Pillars

1. Metrics (Prometheus)

Model: Pull-based metrics collection
Implementation:
  • Client Library: metrics crate + metrics-exporter-prometheus
  • Endpoint: http://localhost:9000/metrics (client), http://localhost:9001/metrics (server)
  • Format: Prometheus exposition format
Key Metrics:
  • tick_duration_seconds: Game loop duration (histogram)
  • entities_count: Number of active entities (gauge)
  • fps: Frames per second - client only (gauge)
  • network_packet_size_bytes: Bandwidth usage (histogram)
  • active_connections: Number of connected players - server only (gauge)
  • combat_events_total: Combat interactions (counter)
  • quest_completions_total: Quest completions by quest_id (counter)
Prometheus Configuration: See prometheus_config.md

2. Structured Logging (Tracing)

Library: tracing + tracing-subscriber
Log Formats:
  • Development: Pretty printing with fmt::layer
    let fmt_layer = tracing_subscriber::fmt::layer()
       .with_target(false)
       .with_thread_ids(true)
       .with_level(true);
  • Production: JSON with json::layer for ingestion by Loki/Elasticsearch
    let fmt_layer = tracing_subscriber::fmt::layer()
       .json()
       .with_current_span(true)
       .with_span_list(true);
Log Levels:
  • TRACE: Detailed debugging (disabled in production)
  • DEBUG: Development diagnostics
  • INFO: Significant game events (player login, quest complete, item drop)
  • WARN: Recoverable issues (packet drop, interpolation correction, retry attempts)
  • ERROR: System failures (DB connection lost, asset load failure)
Environment Variable: RUST_LOG=info,legends_client=debug

3. Error Tracking (Sentry)

Integration: sentry + sentry-tracing
Scope:
  • Panic capturing (automatic)
  • Error-level log events (via sentry-tracing layer)
  • Custom error contexts (breadcrumbs)
Configuration:
let sentry_dsn = std::env::var("SENTRY_DSN").unwrap_or_default();
let _guard = sentry::init((sentry_dsn, sentry::ClientOptions {
    release: sentry::release_name!(),
    environment: Some("production".into()),
    ..Default::default()
}));
Environment Variable: SENTRY_DSN=https://<key>@sentry.io/<project>

4. Distributed Tracing (Future - OpenTelemetry)

Goal: Trace requests across Client → Game Server → Database
Integration Plan:
  • Add tracing-opentelemetry layer
  • Export to Tempo or Jaeger backend
  • Correlate spans across service boundaries
Example Span:
#[tracing::instrument(skip(inventory))]
fn process_item_use(player_id: u64, item_id: ItemId, inventory: &mut Inventory) {
    // Automatic span creation with player_id and item_id
}

Infrastructure Components

Local Development Stack

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true

Production Stack (Future)

  • Metrics: Prometheus + Thanos (long-term storage)
  • Logs: Loki + Grafana
  • Traces: Tempo + Grafana
  • Dashboards: Grafana
  • Alerting: AlertManager

Alerting Rules (Future)

Critical Alerts

  • High Crash Rate: > 1% of sessions crash within 5 minutes
  • Server Down: No metrics received for 1 minute
  • Database Unavailable: Connection pool exhausted

Warning Alerts

  • Low TPS: Server TPS < 20 for 5 minutes
  • High Latency: P95 tick duration > 50ms
  • Memory Leak: Memory usage increasing > 10% per hour

Metrics Collection Pattern

use metrics::{counter, histogram, gauge};

// Counter: Increment-only
counter!("quest_completions_total", "quest_id" => quest_id).increment(1);

// Histogram: Distribution of values
histogram!("tick_duration_seconds").record(duration.as_secs_f64());

// Gauge: Point-in-time value
gauge!("entities_count").set(entity_count as f64);

Logging Best Practices

Use Structured Fields

// Good: Structured
info!(player_id = %player.id, quest_id = %quest.id, "Quest completed");

// Bad: String interpolation
info!("Quest {} completed by player {}", quest.id, player.id);

Use Spans for Context

let span = info_span!("combat_tick", attacker = %attacker_id, target = %target_id);
let _enter = span.enter();

// All logs within this scope inherit the span context
debug!("Calculating damage");
info!(damage = %final_damage, "Damage dealt");

Avoid Logging in Hot Paths

// Bad: Logs every frame
for entity in entities.iter() {
    trace!("Processing entity {:?}", entity); // Too verbose
}

// Good: Log summaries
debug!(entity_count = entities.len(), "Processed entities");

Integration with Bevy

See bevy_telemetry_integration.md for details on integrating this observability stack with Bevy's plugin system.

CI/CD Integration

Automated Checks

  • cargo check: Compilation
  • cargo clippy: Linting
  • cargo test: Unit tests
  • cargo audit: Security vulnerabilities
  • cargo deny: License compliance

Deployment Pipeline

  1. Build release binary
  2. Run integration tests
  3. Deploy to staging
  4. Smoke tests (verify metrics endpoint, Sentry connection)
  5. Deploy to production
  6. Monitor error rates and latency

Cost Considerations

Free Tier Limits

  • Sentry: 5,000 events/month
  • Prometheus: Self-hosted (storage costs only)
  • Grafana Cloud: 10k metrics, 50GB logs (free tier)

Optimization Strategies

  • Sample high-volume metrics (e.g., 10% of tick durations)
  • Use log levels to reduce volume in production
  • Aggregate metrics before export (e.g., per-minute summaries)