disaster recovery
Guide for disaster recovery
Disaster Recovery Plan
Recovery Objectives
- RTO (Recovery Time Objective): 4 hours
- Maximum acceptable downtime
- RPO (Recovery Point Objective): 1 hour
- Maximum acceptable data loss
Disaster Scenarios
1. Database Server Failure
Detection: Database health check fails, application can't connect
Recovery Steps:
# 1. Verify database is truly down
psql -h db-primary.loh.internal -U loh_user -d loh_production
# 2. Promote read replica to primary
# (Automatic via Patroni, or manual if needed)
patronictl failover loh-cluster
# 3. Update application config to point to new primary
export DATABASE_URL="postgresql://new-primary:5432/loh_production"
# 4. Restart application
sudo systemctl restart loh-backend
# 5. Verify
curl https://api.loh.game/healthDuration: 15-30 minutes
2. Complete Server Failure
Detection: Server unresponsive, SSH fails, HTTP requests timeout
Recovery Steps:
# 1. Spin up new server from infrastructure-as-code
terraform apply -var="server_count=2"
# 2. Restore application from Git
git clone https://github.com/loh/backend.git
cd backend
cargo build --release
# 3. Restore database from latest backup
# (see Database Restoration section below)
# 4. Configure Cloudflare Tunnel to point to new server
cloudflared tunnel route dns loh-tunnel game.loh.game
# 5. Start services
sudo systemctl start loh-backend
sudo systemctl start loh-game-serverDuration: 2-3 hours
3. Data Corruption
Detection: Inconsistent data, missing records, failed integrity checks
Recovery Steps:
# 1. Stop writes immediately
sudo systemctl stop loh-backend
# 2. Identify scope of corruption
SELECT COUNT(*) FROM players WHERE created_at > NOW() - INTERVAL '24 hours';
# 3. Restore from point-in-time backup before corruption
# (see Point-in-Time Recovery below)
# 4. Replay transactions if possible
# (from WAL logs or transaction log backups)
# 5. Verify data integrity
SELECT COUNT(*), MAX(id) FROM players;
# 6. Restart application
sudo systemctl start loh-backendDuration: 3-4 hours (depends on backup size)
4. Ransomware / Security Breach
Detection: Files encrypted, unusual access patterns, data exfiltration alerts
Recovery Steps:
# 1. ISOLATE IMMEDIATELY
# - Disconnect from network
# - Preserve evidence (don't delete anything)
sudo iptables -A INPUT -j DROP (except SSH from known IPs)
# 2. Notify security team and law enforcement
# 3. Assess damage scope
# - What data was accessed?
# - Was encryption applied?
# - Are backups compromised?
# 4. Restore from clean backup
# - Verify backup is from before breach
# - Scan restored system for malware
# 5. Rotate ALL credentials
# - Database passwords
# - API keys
# - JWT secrets
# - SSH keys
# 6. Apply security patches and harden system
# 7. Gradual restore of service
# - Start in read-only mode
# - Monitor for 24 hours
# - Full restore once confidentDuration: 1-3 days (includes forensics)
Database Restoration
From Daily Backup
# 1. Stop application
sudo systemctl stop loh-backend
# 2. List available backups
ls -lh /backups/postgresql/
# 3. Restore from backup
pg_restore -U loh_user -d loh_production \
/backups/postgresql/loh_prod_20250104_0200.dump
# 4. Verify restoration
psql -U loh_user -d loh_production -c "SELECT COUNT(*) FROM players;"
# 5. Restart application
sudo systemctl start loh-backendData Loss: Up to 24 hours (since last daily backup)
Point-in-Time Recovery (PITR)
# 1. Stop database
sudo systemctl stop postgresql
# 2. Restore base backup
tar -xzf /backups/base_backup_20250104.tar.gz -C /var/lib/postgresql/14/main
# 3. Apply WAL logs up to specific timestamp
recovery_target_time = '2025-01-05 14:30:00'
# (configure in postgresql.conf)
# 4. Start PostgreSQL in recovery mode
sudo systemctl start postgresql
# 5. Once recovery complete, promote to primary
pg_ctl promote
# 6. Verify
psql -c "SELECT NOW(), pg_last_wal_replay_lsn();"Data Loss: Minimal (RPO < 1 hour if WAL archiving enabled)
Backup Strategy
Automated Backups
Schedule:
- Hourly: Transaction logs (WAL) → S3
- Daily: Full database dump → S3 (2 AM UTC)
- Weekly: Server snapshots → AWS/GCP
Retention:
- Hourly: 7 days
- Daily: 30 days
- Weekly: 90 days
Verification:
- Restore test: Monthly
- Integrity check: Weekly
Backup Verification Script
#!/bin/bash
# verify_backup.sh
# Run monthly to ensure backups are restorable
BACKUP_FILE="/backups/postgresql/latest.dump"
TEST_DB="loh_test_restore"
# Create test database
psql -U postgres -c "DROP DATABASE IF EXISTS $TEST_DB;"
psql -U postgres -c "CREATE DATABASE $TEST_DB;"
# Restore
pg_restore -U postgres -d $TEST_DB $BACKUP_FILE
# Verify
ROW_COUNT=$(psql -U postgres -d $TEST_DB -t -c "SELECT COUNT(*) FROM players;")
if [ "$ROW_COUNT" -gt 0 ]; then
echo "✅ Backup verified: $ROW_COUNT players restored"
exit 0
else
echo "❌ Backup verification FAILED"
exit 1
fiRecovery Contact List
Post-Recovery Actions
- Incident Report: Document what happened, how long recovery took
- Root Cause Analysis: Why did disaster occur? How to prevent?
- Update Runbooks: Capture any new learnings
- Test Recovery Plan: Schedule DR drill within 30 days
- User Communication: Transparent post-mortem if users affected