disaster recovery

Guide for disaster recovery

Disaster Recovery Plan

Recovery Objectives

RTO (Recovery Time Objective): 4 hours
- Maximum acceptable downtime
RPO (Recovery Point Objective): 1 hour
- Maximum acceptable data loss

Disaster Scenarios

1. Database Server Failure

Detection: Database health check fails, application can't connect

Recovery Steps:

# 1. Verify database is truly down
psql -h db-primary.loh.internal -U loh_user -d loh_production

# 2. Promote read replica to primary
# (Automatic via Patroni, or manual if needed)
patronictl failover loh-cluster

# 3. Update application config to point to new primary
export DATABASE_URL="postgresql://new-primary:5432/loh_production"

# 4. Restart application
sudo systemctl restart loh-backend

# 5. Verify
curl https://api.loh.game/health

Duration: 15-30 minutes

2. Complete Server Failure

Detection: Server unresponsive, SSH fails, HTTP requests timeout

Recovery Steps:

# 1. Spin up new server from infrastructure-as-code
terraform apply -var="server_count=2"

# 2. Restore application from Git
git clone https://github.com/loh/backend.git
cd backend
cargo build --release

# 3. Restore database from latest backup
# (see Database Restoration section below)

# 4. Configure Cloudflare Tunnel to point to new server
cloudflared tunnel route dns loh-tunnel game.loh.game

# 5. Start services
sudo systemctl start loh-backend
sudo systemctl start loh-game-server

Duration: 2-3 hours

3. Data Corruption

Detection: Inconsistent data, missing records, failed integrity checks

Recovery Steps:

# 1. Stop writes immediately
sudo systemctl stop loh-backend

# 2. Identify scope of corruption
SELECT COUNT(*) FROM players WHERE created_at > NOW() - INTERVAL '24 hours';

# 3. Restore from point-in-time backup before corruption
# (see Point-in-Time Recovery below)

# 4. Replay transactions if possible
# (from WAL logs or transaction log backups)

# 5. Verify data integrity
SELECT COUNT(*), MAX(id) FROM players;

# 6. Restart application
sudo systemctl start loh-backend

Duration: 3-4 hours (depends on backup size)

4. Ransomware / Security Breach

Detection: Files encrypted, unusual access patterns, data exfiltration alerts

Recovery Steps:

# 1. ISOLATE IMMEDIATELY
# - Disconnect from network
# - Preserve evidence (don't delete anything)
sudo iptables -A INPUT -j DROP (except SSH from known IPs)

# 2. Notify security team and law enforcement

# 3. Assess damage scope
# - What data was accessed?
# - Was encryption applied?
# - Are backups compromised?

# 4. Restore from clean backup
# - Verify backup is from before breach
# - Scan restored system for malware

# 5. Rotate ALL credentials
# - Database passwords
# - API keys
# - JWT secrets
# - SSH keys

# 6. Apply security patches and harden system

# 7. Gradual restore of service
# - Start in read-only mode
# - Monitor for 24 hours
# - Full restore once confident

Duration: 1-3 days (includes forensics)

Database Restoration

From Daily Backup

# 1. Stop application
sudo systemctl stop loh-backend

# 2. List available backups
ls -lh /backups/postgresql/

# 3. Restore from backup
pg_restore -U loh_user -d loh_production \
  /backups/postgresql/loh_prod_20250104_0200.dump

# 4. Verify restoration
psql -U loh_user -d loh_production -c "SELECT COUNT(*) FROM players;"

# 5. Restart application
sudo systemctl start loh-backend

Data Loss: Up to 24 hours (since last daily backup)

Point-in-Time Recovery (PITR)

# 1. Stop database
sudo systemctl stop postgresql

# 2. Restore base backup
tar -xzf /backups/base_backup_20250104.tar.gz -C /var/lib/postgresql/14/main

# 3. Apply WAL logs up to specific timestamp
recovery_target_time = '2025-01-05 14:30:00'
# (configure in postgresql.conf)

# 4. Start PostgreSQL in recovery mode
sudo systemctl start postgresql

# 5. Once recovery complete, promote to primary
pg_ctl promote

# 6. Verify
psql -c "SELECT NOW(), pg_last_wal_replay_lsn();"

Data Loss: Minimal (RPO < 1 hour if WAL archiving enabled)

Backup Strategy

Automated Backups

Schedule:

Hourly: Transaction logs (WAL) → S3
Daily: Full database dump → S3 (2 AM UTC)
Weekly: Server snapshots → AWS/GCP

Retention:

Hourly: 7 days
Daily: 30 days
Weekly: 90 days

Verification:

Restore test: Monthly
Integrity check: Weekly

Backup Verification Script

#!/bin/bash
# verify_backup.sh
# Run monthly to ensure backups are restorable

BACKUP_FILE="/backups/postgresql/latest.dump"
TEST_DB="loh_test_restore"

# Create test database
psql -U postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"
psql -U postgres -c "CREATE DATABASE $TEST_DB;"

# Restore
pg_restore -U postgres -d $TEST_DB $BACKUP_FILE

# Verify
ROW_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM players;")

if [ "$ROW_COUNT" -gt 0 ]; then
  echo "✅ Backup verified: $ROW_COUNT players restored"
  exit 0
else
  echo "❌ Backup verification FAILED"
  exit 1
fi

Recovery Contact List

Role	Name	Contact	Responsibility
On-Call Engineer	Rotation	PagerDuty	First responder
Database Admin	---	Slack @dba	Database recovery
Security Lead	---	security@loh.game	Breach response
Infrastructure Lead	---	Slack @infra	Server provisioning
CTO	---	Phone	Final authority

Post-Recovery Actions

Incident Report: Document what happened, how long recovery took
Root Cause Analysis: Why did disaster occur? How to prevent?
Update Runbooks: Capture any new learnings
Test Recovery Plan: Schedule DR drill within 30 days
User Communication: Transparent post-mortem if users affected