Better Practices of Site Reliability Engineering
General Practices
- Hire only coders.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate launches on them.
- Have a common staffing pool for SRE and Developers.
- Have excess Ops work overflow to the Dev team.
- Cap SRE operational load at 50 percent.
- Share 5 percent of Ops work with the Dev team.
- Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
- Aim for a maximum of two events per oncall shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
SLI = [Good events / Valid events] x 100
Reliability level | Per year | Per quarter | Per 30 days |
---|
90% | 36.5 days | 9 days | 3 days |
95% | 18.25 days | 4.5 days | 1.5 days |
99% | 3.65 days | 21.6 hours | 7.2 hours |
99.5% | 1.83 days | 10.8 hours | 3.6 hours |
99.9% | 8.76 hours | 2.16 hours | 43.2 minutes |
99.95% | 4.38 hours | 1.08 hours | 21.6 minutes |
99.99% | 52.6 minutes | 12.96 minutes | 4.32 minutes |
99.999% | 5.26 minutes | 1.30 minutes | 25.9 seconds |
SOW (Scope of Work)
- 组织定位
- 监控建设
- 变更管理
- 异常响应
- 稳定性治理
- 事故复盘
- 容量管理
- 成本控制
- 活动保障
Reference