disaster recovery

Guide for disaster recovery

Disaster Recovery Plan

Recovery Objectives

  • RTO (Recovery Time Objective): 4 hours
    • Maximum acceptable downtime
  • RPO (Recovery Point Objective): 1 hour
    • Maximum acceptable data loss

Disaster Scenarios

1. Database Server Failure

Detection: Database health check fails, application can't connect
Recovery Steps:
# 1. Verify database is truly down
psql -h db-primary.loh.internal -U loh_user -d loh_production

# 2. Promote read replica to primary
# (Automatic via Patroni, or manual if needed)
patronictl failover loh-cluster

# 3. Update application config to point to new primary
export DATABASE_URL="postgresql://new-primary:5432/loh_production"

# 4. Restart application
sudo systemctl restart loh-backend

# 5. Verify
curl https://api.loh.game/health
Duration: 15-30 minutes

2. Complete Server Failure

Detection: Server unresponsive, SSH fails, HTTP requests timeout
Recovery Steps:
# 1. Spin up new server from infrastructure-as-code
terraform apply -var="server_count=2"

# 2. Restore application from Git
git clone https://github.com/loh/backend.git
cd backend
cargo build --release

# 3. Restore database from latest backup
# (see Database Restoration section below)

# 4. Configure Cloudflare Tunnel to point to new server
cloudflared tunnel route dns loh-tunnel game.loh.game

# 5. Start services
sudo systemctl start loh-backend
sudo systemctl start loh-game-server
Duration: 2-3 hours

3. Data Corruption

Detection: Inconsistent data, missing records, failed integrity checks
Recovery Steps:
# 1. Stop writes immediately
sudo systemctl stop loh-backend

# 2. Identify scope of corruption
SELECT COUNT(*) FROM players WHERE created_at > NOW() - INTERVAL '24 hours';

# 3. Restore from point-in-time backup before corruption
# (see Point-in-Time Recovery below)

# 4. Replay transactions if possible
# (from WAL logs or transaction log backups)

# 5. Verify data integrity
SELECT COUNT(*), MAX(id) FROM players;

# 6. Restart application
sudo systemctl start loh-backend
Duration: 3-4 hours (depends on backup size)

4. Ransomware / Security Breach

Detection: Files encrypted, unusual access patterns, data exfiltration alerts
Recovery Steps:
# 1. ISOLATE IMMEDIATELY
# - Disconnect from network
# - Preserve evidence (don't delete anything)
sudo iptables -A INPUT -j DROP (except SSH from known IPs)

# 2. Notify security team and law enforcement

# 3. Assess damage scope
# - What data was accessed?
# - Was encryption applied?
# - Are backups compromised?

# 4. Restore from clean backup
# - Verify backup is from before breach
# - Scan restored system for malware

# 5. Rotate ALL credentials
# - Database passwords
# - API keys
# - JWT secrets
# - SSH keys

# 6. Apply security patches and harden system

# 7. Gradual restore of service
# - Start in read-only mode
# - Monitor for 24 hours
# - Full restore once confident
Duration: 1-3 days (includes forensics)

Database Restoration

From Daily Backup

# 1. Stop application
sudo systemctl stop loh-backend

# 2. List available backups
ls -lh /backups/postgresql/

# 3. Restore from backup
pg_restore -U loh_user -d loh_production \
  /backups/postgresql/loh_prod_20250104_0200.dump

# 4. Verify restoration
psql -U loh_user -d loh_production -c "SELECT COUNT(*) FROM players;"

# 5. Restart application
sudo systemctl start loh-backend
Data Loss: Up to 24 hours (since last daily backup)

Point-in-Time Recovery (PITR)

# 1. Stop database
sudo systemctl stop postgresql

# 2. Restore base backup
tar -xzf /backups/base_backup_20250104.tar.gz -C /var/lib/postgresql/14/main

# 3. Apply WAL logs up to specific timestamp
recovery_target_time = '2025-01-05 14:30:00'
# (configure in postgresql.conf)

# 4. Start PostgreSQL in recovery mode
sudo systemctl start postgresql

# 5. Once recovery complete, promote to primary
pg_ctl promote

# 6. Verify
psql -c "SELECT NOW(), pg_last_wal_replay_lsn();"
Data Loss: Minimal (RPO < 1 hour if WAL archiving enabled)

Backup Strategy

Automated Backups

Schedule:
  • Hourly: Transaction logs (WAL) → S3
  • Daily: Full database dump → S3 (2 AM UTC)
  • Weekly: Server snapshots → AWS/GCP
Retention:
  • Hourly: 7 days
  • Daily: 30 days
  • Weekly: 90 days
Verification:
  • Restore test: Monthly
  • Integrity check: Weekly

Backup Verification Script

#!/bin/bash
# verify_backup.sh
# Run monthly to ensure backups are restorable

BACKUP_FILE="/backups/postgresql/latest.dump"
TEST_DB="loh_test_restore"

# Create test database
psql -U postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"
psql -U postgres -c "CREATE DATABASE $TEST_DB;"

# Restore
pg_restore -U postgres -d $TEST_DB $BACKUP_FILE

# Verify
ROW_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM players;")

if [ "$ROW_COUNT" -gt 0 ]; then
  echo "✅ Backup verified: $ROW_COUNT players restored"
  exit 0
else
  echo "❌ Backup verification FAILED"
  exit 1
fi

Recovery Contact List

RoleNameContactResponsibility
On-Call EngineerRotationPagerDutyFirst responder
Database Admin---Slack @dbaDatabase recovery
Security Lead---security@loh.gameBreach response
Infrastructure Lead---Slack @infraServer provisioning
CTO---PhoneFinal authority

Post-Recovery Actions

  1. Incident Report: Document what happened, how long recovery took
  2. Root Cause Analysis: Why did disaster occur? How to prevent?
  3. Update Runbooks: Capture any new learnings
  4. Test Recovery Plan: Schedule DR drill within 30 days
  5. User Communication: Transparent post-mortem if users affected