data tiering design

Guide for data tiering design

Data Tiering & Archival Strategy (100k CCU / 100M MAU)

1. High-Level Strategy

To support 100M MAU with 100k Concurrent Users (CCU) without a massive multi-petabyte database, we will implement a Time-Based Tiering Architecture.

Core Concept: "Active" players live in the fast, expensive DB. "Inactive" players are moved to cheaper, slower storage by background workers.

2. Storage Tiers

Tier	Name	Target Users	Storage Technology	Latency	Cost/GB
Tier 1	Hot (Active)	Currently Online + Played < 3 Days ago	Postgres (NVMe)	< 1ms	$$$
Tier 2	Warm (Recent)	Played 3 - 30 Days ago	Postgres (HDD/SATA SSD) or Redis	< 10ms	$$
Tier 3	Cold (Archive)	Played > 30 Days ago	S3 / Blob Storage	~200ms	$

Capacity Breakdown (100M Users)

Hot Tier (100k CCU + Buffer): ~1M Users (1% of total).
- Size: ~100 GB (Fits in RAM).
- Load: 100% of Write Traffic.
Warm Tier (Churned/Casual): ~9M Users.
- Size: ~1 TB.
- Load: Occasional reads (Login checks), low writes (Worker migrations).
Cold Tier (Long Tail): ~90M Users.
- Size: ~10 TB.
- Load: Rare reads (Reactivation), Write-once (Archival).

3. Worker Architecture

We will run a set of dedicated Archival Workers separate from the Game Servers.

A. The "Cool Down" Worker (Hot -> Warm)

Trigger: Scheduled Job (e.g., every hour).
Query: SELECT id FROM players WHERE last_login < NOW() - INTERVAL '3 days' AND tier = 'hot'
Action:
1. Move rows to players_warm table (or separate DB partition).
2. Delete from players_hot.
3. Update index / lookup table.

B. The "Deep Freeze" Worker (Warm -> Cold)

Trigger: Daily Job.
Query: SELECT id FROM players WHERE last_login < NOW() - INTERVAL '30 days' AND tier = 'warm'
Action:
1. Serialize full player state to JSON/Protobuf.
2. Upload to S3: s3://game-archives/players/{uuid}.json.
3. Update Lookup Table: SET status='archived', s3_path='...'.
4. Delete from players_warm.

Trigger: Player logs in.
Logic:
1. Check Hot DB -> Found? Login.
2. Check Warm DB -> Found? Move to Hot (Async or Sync) -> Login.
3. Check Cold Lookup -> Found?
  - Blocking: User sees "Restoring Account..." spinner (2-3s).
  - Action: Download S3 JSON -> Insert to Hot DB -> Login.

4. 100k CCU Hardware Implications

With this strategy, the Hot Database only needs to handle the 100k Active Users + 1M Recent Users.

Hot DB Specs (100k CCU):
- Writes: ~25k - 50k TPS (Heavy Load).
- Size: Small (~100GB).
- Hardware: 1x Primary + 2x Replicas.
  - CPU: High frequency (Ryzen/Core i9 equivalent or Xeon Gold).
  - RAM: 128GB (Fit entire Hot DB in RAM).
  - Disk: 2x 1TB NVMe RAID 1.
Workers:
- Can run on cheap commodity hardware / spot instances.
- Queue-based (RabbitMQ/Redis) to throttle load on the DB.

5. Schema Strategy

Option A: Partitioned Tables (Postgres Native)

Use Postgres declarative partitioning:

CREATE TABLE players (
    id UUID,
    last_login TIMESTAMP,
    ...
) PARTITION BY RANGE (last_login);
-- But this requires moving rows which Postgres doesn't do automatically easily.

Option B: Application-Level Tables (Recommended)

Explicit tables give finer control over storage engines.

players_active (On NVMe Tablespace)
players_archive_lookup (Pointer to S3)
player_data_blob (Optional: Intermediate blob store)

Next Steps

Define the PlayerArchive capabilities (S3 compatible store).
Prototype the "Thaw" logic in the Login Server.
Implement the ArchivalWorker service.