incident response

Guide for incident response

Incident Response Plan

Incident Classification

Severity	Definition	Response Time	Examples
P0 - Critical	Complete service outage or active data breach	<15 min	Database down, active exploit
P1 - High	Partial outage affecting many users	<1 hour	Server crash, payment processing down
P2 - Medium	Degraded performance or isolated issues	<4 hours	High latency, minor feature broken
P3 - Low	Minor bugs, cosmetic issues	<24 hours	UI glitches, typos

Response Procedures

Phase 1: Detection & Triage (0-5 min)

Alert Received: Via Sentry, CloudflareMonitoring, or user reports
Initial Assessment:
- Verify the issue is real
- Determine severity (P0-P3)
- Identify affected systems
Declare Incident: Create incident channel/ticket

Phase 2: Containment (5-30 min)

For Security Incidents:

Isolate affected systems (take offline if necessary)
Rotate credentials if compromised
Block malicious IPs via WAF
Take database snapshots before changes

For Availability Incidents:

Enable maintenance mode if needed
Redirect traffic to backup servers
Scale up resources if capacity issue

Phase 3: Investigation (30 min - 4 hours)

Gather Evidence:
- Server logs (/var/log/loh-backend/)
- Database query logs
- Sentry error reports
- Cloudflare analytics
Root Cause Analysis:
- Timeline of events
- Identify trigger
- Assess impact scope

Phase 4: Resolution (1-8 hours)

Apply Fix:
- Deploy hotfix (bypass CI for P0)
- Rollback if recent deployment caused issue
- Restore from backup if data corruption
Verify Fix:
- Run smoke tests
- Monitor error rates for 1 hour
- Confirm user reports resolved

Phase 5: Recovery & Communication

Restore Normal Operations:
- Disable maintenance mode
- Resume normal traffic routing
User Communication:
- Post status update (Discord/Twitter/In-game)
- Compensation if needed (e.g., free subscription days)
Internal Debrief:
- Schedule post-mortem within 48 hours
- Document lessons learned

Communication Templates

P0 Incident Announcement

🚨 Service Alert 🚨
We are currently experiencing technical difficulties with [SYSTEM].
Our team is actively investigating and working on a fix.
ETA: [TIME]
We apologize for the inconvenience.
Status updates: https://status.loh.game

Resolution Announcement

✅ All Clear
The issue with [SYSTEM] has been resolved.
Root cause: [BRIEF EXPLANATION]
Thank you for your patience!

Escalation Matrix

Role	Responsibility	Contact
On-Call Engineer	First responder	PagerDuty
Engineering Lead	Escalation for P0/P1	Slack @eng-lead
Security Lead	Security incidents	security@loh.game
CTO	P0 incidents >2 hours	Direct phone

Post-Mortem Process

When Required: All P0 incidents, P1 if recurring

Timeline: Within 48 hours of resolution

Participants: Incident responders, affected team leads, product owner

Template:

Incident Summary
- What happened?
- When did it happen?
- Who was affected?
Timeline: Detailed chronology
Root Cause: 5 Whys analysis
Impact Assessment:
- Users affected: [COUNT]
- Downtime: [DURATION]
- Revenue impact: [$AMOUNT]
Action Items:
- Immediate fixes (done)
- Preventive measures (TODO)
- Process improvements (TODO)
Lessons Learned: What went well, what didn't