data tiering design

Guide for data tiering design

Data Tiering & Archival Strategy (100k CCU / 100M MAU)

1. High-Level Strategy

To support 100M MAU with 100k Concurrent Users (CCU) without a massive multi-petabyte database, we will implement a Time-Based Tiering Architecture.
Core Concept: "Active" players live in the fast, expensive DB. "Inactive" players are moved to cheaper, slower storage by background workers.

2. Storage Tiers

TierNameTarget UsersStorage TechnologyLatencyCost/GB
Tier 1Hot (Active)Currently Online + Played < 3 Days agoPostgres (NVMe)< 1ms$$$
Tier 2Warm (Recent)Played 3 - 30 Days agoPostgres (HDD/SATA SSD) or Redis< 10ms$$
Tier 3Cold (Archive)Played > 30 Days agoS3 / Blob Storage~200ms$

Capacity Breakdown (100M Users)

  • Hot Tier (100k CCU + Buffer): ~1M Users (1% of total).
    • Size: ~100 GB (Fits in RAM).
    • Load: 100% of Write Traffic.
  • Warm Tier (Churned/Casual): ~9M Users.
    • Size: ~1 TB.
    • Load: Occasional reads (Login checks), low writes (Worker migrations).
  • Cold Tier (Long Tail): ~90M Users.
    • Size: ~10 TB.
    • Load: Rare reads (Reactivation), Write-once (Archival).

3. Worker Architecture

We will run a set of dedicated Archival Workers separate from the Game Servers.

A. The "Cool Down" Worker (Hot -> Warm)

  • Trigger: Scheduled Job (e.g., every hour).
  • Query: SELECT id FROM players WHERE last_login < NOW() - INTERVAL '3 days' AND tier = 'hot'
  • Action:
    1. Move rows to players_warm table (or separate DB partition).
    2. Delete from players_hot.
    3. Update index / lookup table.

B. The "Deep Freeze" Worker (Warm -> Cold)

  • Trigger: Daily Job.
  • Query: SELECT id FROM players WHERE last_login < NOW() - INTERVAL '30 days' AND tier = 'warm'
  • Action:
    1. Serialize full player state to JSON/Protobuf.
    2. Upload to S3: s3://game-archives/players/{uuid}.json.
    3. Update Lookup Table: SET status='archived', s3_path='...'.
    4. Delete from players_warm.

C. The "Thaw" Process (On Login)

  • Trigger: Player logs in.
  • Logic:
    1. Check Hot DB -> Found? Login.
    2. Check Warm DB -> Found? Move to Hot (Async or Sync) -> Login.
    3. Check Cold Lookup -> Found?
      • Blocking: User sees "Restoring Account..." spinner (2-3s).
      • Action: Download S3 JSON -> Insert to Hot DB -> Login.

4. 100k CCU Hardware Implications

With this strategy, the Hot Database only needs to handle the 100k Active Users + 1M Recent Users.
  • Hot DB Specs (100k CCU):
    • Writes: ~25k - 50k TPS (Heavy Load).
    • Size: Small (~100GB).
    • Hardware: 1x Primary + 2x Replicas.
      • CPU: High frequency (Ryzen/Core i9 equivalent or Xeon Gold).
      • RAM: 128GB (Fit entire Hot DB in RAM).
      • Disk: 2x 1TB NVMe RAID 1.
  • Workers:
    • Can run on cheap commodity hardware / spot instances.
    • Queue-based (RabbitMQ/Redis) to throttle load on the DB.

5. Schema Strategy

Option A: Partitioned Tables (Postgres Native)

Use Postgres declarative partitioning:
CREATE TABLE players (
    id UUID,
    last_login TIMESTAMP,
    ...
) PARTITION BY RANGE (last_login);
-- But this requires moving rows which Postgres doesn't do automatically easily.
Explicit tables give finer control over storage engines.
  • players_active (On NVMe Tablespace)
  • players_archive_lookup (Pointer to S3)
  • player_data_blob (Optional: Intermediate blob store)

Next Steps

  1. Define the PlayerArchive capabilities (S3 compatible store).
  2. Prototype the "Thaw" logic in the Login Server.
  3. Implement the ArchivalWorker service.