Better Practice of Site Reliability Engineering
General Practices
- Hire only coders.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate launches on them.
- Have a common staffing pool for SRE and Developers.
- Have excess Ops work overflow to the Dev team.
- Cap SRE operational load at 50 percent.
- Share 5 percent of Ops work with the Dev team.
- Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
- Aim for a maximum of two events per oncall shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
SLI
SLI = [Good events / Valid events] x 100
Reliability level | Per year | Per quarter | Per 30 days |
---|---|---|---|
90% | 36.5 days | 9 days | 3 days |
95% | 18.25 days | 4.5 days | 1.5 days |
99% | 3.65 days | 21.6 hours | 7.2 hours |
99.5% | 1.83 days | 10.8 hours | 3.6 hours |
99.9% | 8.76 hours | 2.16 hours | 43.2 minutes |
99.95% | 4.38 hours | 1.08 hours | 21.6 minutes |
99.99% | 52.6 minutes | 12.96 minutes | 4.32 minutes |
99.999% | 5.26 minutes | 1.30 minutes | 25.9 seconds |
SOW - Scope of Work
SOW of SRE
- 组织定位
- 监控建设
- 变更管理
- 异常响应
- 稳定性治理
- 事故复盘
- 容量管理
- 成本控制
- 活动保障
Infrastructure Life Cycle
Lifecycle
- Configuration
- Startup and shutdown
- Queue draining
- Software upgrades
- Backups and restores
- Redundancy
- Replicated databases
- Hot swaps
- Toggles for individual features
- Graceful degradation
- Access controls and rate limits
- Data import controls
- Monitoring
- Auditing
- Debug instrumentation
- Exception collection
Reference
https://www.infracloud.io/blogs/sre-best-practices
- 《大型网站运维:从系统管理到SRE》
Disclaimer
- License under
CC BY-NC 4.0
- Copyright issue feedback
me#imzye.me
, replace # with @ - Not all the commands and scripts are tested in production environment, use at your own risk
- No privacy information is collected here