Skip to content

Better Practice of Site Reliability Engineering

General Practices

  1. Hire only coders.
  2. Have an SLA for your service.
  3. Measure and report performance against the SLA.
  4. Use Error Budgets and gate launches on them.
  5. Have a common staffing pool for SRE and Developers.
  6. Have excess Ops work overflow to the Dev team.
  7. Cap SRE operational load at 50 percent.
  8. Share 5 percent of Ops work with the Dev team.
  9. Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
  10. Aim for a maximum of two events per oncall shift.
  11. Do a postmortem for every event.
  12. Postmortems are blameless and focus on process and technology, not people.

SLI

SLI = [Good events / Valid events] x 100
Reliability levelPer yearPer quarterPer 30 days
90%36.5 days9 days3 days
95%18.25 days4.5 days1.5 days
99%3.65 days21.6 hours7.2 hours
99.5%1.83 days10.8 hours3.6 hours
99.9%8.76 hours2.16 hours43.2 minutes
99.95%4.38 hours1.08 hours21.6 minutes
99.99%52.6 minutes12.96 minutes4.32 minutes
99.999%5.26 minutes1.30 minutes25.9 seconds

Image of web terminology

SOW - Scope of Work

SOW of SRE

  1. 组织定位
  2. 监控建设
  3. 变更管理
  4. 异常响应
  5. 稳定性治理
  6. 事故复盘
  7. 容量管理
  8. 成本控制
  9. 活动保障

Infrastructure Life Cycle

Lifecycle

  • Configuration
  • Startup and shutdown
  • Queue draining
  • Software upgrades
  • Backups and restores
  • Redundancy
  • Replicated databases
  • Hot swaps
  • Toggles for individual features
  • Graceful degradation
  • Access controls and rate limits
  • Data import controls
  • Monitoring
  • Auditing
  • Debug instrumentation
  • Exception collection

Reference

  • https://www.infracloud.io/blogs/sre-best-practices
  • 《大型网站运维:从系统管理到SRE》

Disclaimer
  1. License under CC BY-NC 4.0
  2. Copyright issue feedback me#imzye.me, replace # with @
  3. Not all the commands and scripts are tested in production environment, use at your own risk
  4. No privacy information is collected here
Try iOS App