backend scalability walkthrough

Guide for backend scalability walkthrough

Backend Scalability & Archival System Walkthrough

This document outlines the implementation of the scalable backend infrastructure, specifically focusing on Data Tiering (to support 100M+ MAU) and the SRE Logging stack.

1. Data Tiering & Archival System

To keep the primary "Hot" database small while supporting massive user counts, we implemented a 3-tier strategy.

Tier Definitions

TierStorage LocationLatencyUse Case
HotPostgreSQL (Cloud/NVMe)<10msActive players (last 3d)
WarmPostgreSQL (Same DB)<10msInactive players (3d - 30d). Excluded from hot queries.
ColdS3 Schema-less JSON~500msArchived players (>30d). Deleted from DB.

Implementation Details

  • Worker Binary: api-ops/src/bin/archival_worker.rs
  • Logic Module: api-ops/src/archival_worker.rs
  • Snapshot Logic: api-ops/src/player_snapshot.rs
  • Storage Adapter: api-ops/src/storage_adapter.rs (S3 implementation)

Archive Lifecycle

  1. Cooldown: process_hot_to_warm runs hourly. Moves users inactive > 3 days to 'warm' tier.
  2. Deep Freeze: process_warm_to_cold runs hourly.
    • Fetches full player state (Stats, Inventory, Bank, Friends).
    • Serializes to JSON.
    • Uploads to S3 (bucket/players/{uuid}.json).
    • Deletes rows from bank_storage, player_equipment, friends.
    • Updates players table row to tier='archived', sets archive_url.
  3. Thaw: process_all_restores runs hourly (or on demand).
    • Finds players with tier='restoring'.
    • Downloads JSON from S3.
    • Restores all data to PostgreSQL tables.
    • Sets tier='hot'.

Verification

Ran cargo check -p api-ops to verify the new worker and S3 integration.
Finished dev profile [unoptimized + debuginfo] target(s) in 0.58s

2. Infrastructure & Logging (PLG Stack)

To handle 100k CCU log volume (est. 1TB+/month), we moved away from ELK to a more efficient PLG stack.
  • Promtail: Scrapes Docker container logs natively. defined in loh-devops/infrastructure/game/config/promtail.yml.
  • Loki: Stores logs with high compression (S3-ready). defined in loh-devops/infrastructure/game/config/loki.yml.
  • Grafana: Visualizes logs. Datasource auto-provisioned in loh-devops/infrastructure/backend/grafana/provisioning/datasources/loki_datasource.yml.

3. Hardware Strategy

For the 100k CCU / 100M MAU target:
  • Database: Managed Postgres (Cloud) recommended for Reliability.
  • Compute: Refurbished Mini-PC Cluster (e.g., Lenovo ThinkCentre Tiny) for cost-effective self-hosting of Game Servers and Archival Workers.