incident response

Guide for incident response

Incident Response Plan

Incident Classification

SeverityDefinitionResponse TimeExamples
P0 - CriticalComplete service outage or active data breach<15 minDatabase down, active exploit
P1 - HighPartial outage affecting many users<1 hourServer crash, payment processing down
P2 - MediumDegraded performance or isolated issues<4 hoursHigh latency, minor feature broken
P3 - LowMinor bugs, cosmetic issues<24 hoursUI glitches, typos

Response Procedures

Phase 1: Detection & Triage (0-5 min)

  1. Alert Received: Via Sentry, CloudflareMonitoring, or user reports
  2. Initial Assessment:
    • Verify the issue is real
    • Determine severity (P0-P3)
    • Identify affected systems
  3. Declare Incident: Create incident channel/ticket

Phase 2: Containment (5-30 min)

For Security Incidents:
  1. Isolate affected systems (take offline if necessary)
  2. Rotate credentials if compromised
  3. Block malicious IPs via WAF
  4. Take database snapshots before changes
For Availability Incidents:
  1. Enable maintenance mode if needed
  2. Redirect traffic to backup servers
  3. Scale up resources if capacity issue

Phase 3: Investigation (30 min - 4 hours)

  1. Gather Evidence:
    • Server logs (/var/log/loh-backend/)
    • Database query logs
    • Sentry error reports
    • Cloudflare analytics
  2. Root Cause Analysis:
    • Timeline of events
    • Identify trigger
    • Assess impact scope

Phase 4: Resolution (1-8 hours)

  1. Apply Fix:
    • Deploy hotfix (bypass CI for P0)
    • Rollback if recent deployment caused issue
    • Restore from backup if data corruption
  2. Verify Fix:
    • Run smoke tests
    • Monitor error rates for 1 hour
    • Confirm user reports resolved

Phase 5: Recovery & Communication

  1. Restore Normal Operations:
    • Disable maintenance mode
    • Resume normal traffic routing
  2. User Communication:
    • Post status update (Discord/Twitter/In-game)
    • Compensation if needed (e.g., free subscription days)
  3. Internal Debrief:
    • Schedule post-mortem within 48 hours
    • Document lessons learned

Communication Templates

P0 Incident Announcement

🚨 Service Alert 🚨
We are currently experiencing technical difficulties with [SYSTEM].
Our team is actively investigating and working on a fix.
ETA: [TIME]
We apologize for the inconvenience.
Status updates: https://status.loh.game

Resolution Announcement

āœ… All Clear
The issue with [SYSTEM] has been resolved.
Root cause: [BRIEF EXPLANATION]
Thank you for your patience!

Escalation Matrix

RoleResponsibilityContact
On-Call EngineerFirst responderPagerDuty
Engineering LeadEscalation for P0/P1Slack @eng-lead
Security LeadSecurity incidentssecurity@loh.game
CTOP0 incidents >2 hoursDirect phone

Post-Mortem Process

When Required: All P0 incidents, P1 if recurring
Timeline: Within 48 hours of resolution
Participants: Incident responders, affected team leads, product owner
Template:
  1. Incident Summary
    • What happened?
    • When did it happen?
    • Who was affected?
  2. Timeline: Detailed chronology
  3. Root Cause: 5 Whys analysis
  4. Impact Assessment:
    • Users affected: [COUNT]
    • Downtime: [DURATION]
    • Revenue impact: [$AMOUNT]
  5. Action Items:
    • Immediate fixes (done)
    • Preventive measures (TODO)
    • Process improvements (TODO)
  6. Lessons Learned: What went well, what didn't