sre architecture

Guide for sre architecture

SRE Architecture & Observability Strategy

This document outlines the Site Reliability Engineering (SRE) foundation for Legends of Hastinapur.

Observability Pillars

1. Metrics (Prometheus)

Model: Pull-based metrics collection

Implementation:

Client Library: metrics crate + metrics-exporter-prometheus
Endpoint: http://localhost:9000/metrics (client), http://localhost:9001/metrics (server)
Format: Prometheus exposition format

Key Metrics:

tick_duration_seconds: Game loop duration (histogram)
entities_count: Number of active entities (gauge)
fps: Frames per second - client only (gauge)
network_packet_size_bytes: Bandwidth usage (histogram)
active_connections: Number of connected players - server only (gauge)
combat_events_total: Combat interactions (counter)
quest_completions_total: Quest completions by quest_id (counter)

Prometheus Configuration: See prometheus_config.md

2. Structured Logging (Tracing)

Library: tracing + tracing-subscriber

Log Formats:

Development: Pretty printing with fmt::layer

let fmt_layer = tracing_subscriber::fmt::layer()
   .with_target(false)
   .with_thread_ids(true)
   .with_level(true);

Production: JSON with json::layer for ingestion by Loki/Elasticsearch

let fmt_layer = tracing_subscriber::fmt::layer()
   .json()
   .with_current_span(true)
   .with_span_list(true);

Log Levels:

TRACE: Detailed debugging (disabled in production)
DEBUG: Development diagnostics
INFO: Significant game events (player login, quest complete, item drop)
WARN: Recoverable issues (packet drop, interpolation correction, retry attempts)
ERROR: System failures (DB connection lost, asset load failure)

Environment Variable: RUST_LOG=info,legends_client=debug

3. Error Tracking (Sentry)

Integration: sentry + sentry-tracing

Scope:

Panic capturing (automatic)
Error-level log events (via sentry-tracing layer)
Custom error contexts (breadcrumbs)

Configuration:

let sentry_dsn = std::env::var("SENTRY_DSN").unwrap_or_default();
let _guard = sentry::init((sentry_dsn, sentry::ClientOptions {
    release: sentry::release_name!(),
    environment: Some("production".into()),
    ..Default::default()
}));

Environment Variable: SENTRY_DSN=https://<key>@sentry.io/<project>

4. Distributed Tracing (Future - OpenTelemetry)

Goal: Trace requests across Client → Game Server → Database

Integration Plan:

Add tracing-opentelemetry layer
Export to Tempo or Jaeger backend
Correlate spans across service boundaries

Example Span:

#[tracing::instrument(skip(inventory))]
fn process_item_use(player_id: u64, item_id: ItemId, inventory: &mut Inventory) {
    // Automatic span creation with player_id and item_id
}

Infrastructure Components

Local Development Stack

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true

Production Stack (Future)

Metrics: Prometheus + Thanos (long-term storage)
Logs: Loki + Grafana
Traces: Tempo + Grafana
Dashboards: Grafana
Alerting: AlertManager

Alerting Rules (Future)

Critical Alerts

High Crash Rate: > 1% of sessions crash within 5 minutes
Server Down: No metrics received for 1 minute
Database Unavailable: Connection pool exhausted

Warning Alerts

Low TPS: Server TPS < 20 for 5 minutes
High Latency: P95 tick duration > 50ms
Memory Leak: Memory usage increasing > 10% per hour

Metrics Collection Pattern

use metrics::{counter, histogram, gauge};

// Counter: Increment-only
counter!("quest_completions_total", "quest_id" => quest_id).increment(1);

// Histogram: Distribution of values
histogram!("tick_duration_seconds").record(duration.as_secs_f64());

// Gauge: Point-in-time value
gauge!("entities_count").set(entity_count as f64);

Logging Best Practices

Use Structured Fields

// Good: Structured
info!(player_id = %player.id, quest_id = %quest.id, "Quest completed");

// Bad: String interpolation
info!("Quest {} completed by player {}", quest.id, player.id);

Use Spans for Context

let span = info_span!("combat_tick", attacker = %attacker_id, target = %target_id);
let _enter = span.enter();

// All logs within this scope inherit the span context
debug!("Calculating damage");
info!(damage = %final_damage, "Damage dealt");

Avoid Logging in Hot Paths

// Bad: Logs every frame
for entity in entities.iter() {
    trace!("Processing entity {:?}", entity); // Too verbose
}

// Good: Log summaries
debug!(entity_count = entities.len(), "Processed entities");

Integration with Bevy

See bevy_telemetry_integration.md for details on integrating this observability stack with Bevy's plugin system.

CI/CD Integration

Automated Checks

cargo check: Compilation
cargo clippy: Linting
cargo test: Unit tests
cargo audit: Security vulnerabilities
cargo deny: License compliance

Deployment Pipeline

Build release binary
Run integration tests
Deploy to staging
Smoke tests (verify metrics endpoint, Sentry connection)
Deploy to production
Monitor error rates and latency

Cost Considerations

Free Tier Limits

Sentry: 5,000 events/month
Prometheus: Self-hosted (storage costs only)
Grafana Cloud: 10k metrics, 50GB logs (free tier)

Optimization Strategies

Sample high-volume metrics (e.g., 10% of tick durations)
Use log levels to reduce volume in production
Aggregate metrics before export (e.g., per-minute summaries)

tracing_panic_fix.md: Fixing tracing initialization conflicts
prometheus_config.md: Prometheus scrape configuration