Better Practice of Site Reliability Engineering
- Hire only coders.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate launches on them.
- Have a common staffing pool for SRE and Developers.
- Have excess Ops work overflow to the Dev team.
- Cap SRE operational load at 50 percent.
- Share 5 percent of Ops work with the Dev team.
- Oncall teams should have at least eight people at one location, or six people at each of multiple locations.
- Aim for a maximum of two events per oncall shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
SLI = [Good events / Valid events] x 100
|Reliability level||Per year||Per quarter||Per 30 days|
|90%||36.5 days||9 days||3 days|
|95%||18.25 days||4.5 days||1.5 days|
|99%||3.65 days||21.6 hours||7.2 hours|
|99.5%||1.83 days||10.8 hours||3.6 hours|
|99.9%||8.76 hours||2.16 hours||43.2 minutes|
|99.95%||4.38 hours||1.08 hours||21.6 minutes|
|99.99%||52.6 minutes||12.96 minutes||4.32 minutes|
|99.999%||5.26 minutes||1.30 minutes||25.9 seconds|
SOW - Scope of Work
SOW of SRE
Infrastructure Life Cycle
- Startup and shutdown
- Queue draining
- Software upgrades
- Backups and restores
- Replicated databases
- Hot swaps
- Toggles for individual features
- Graceful degradation
- Access controls and rate limits
- Data import controls
- Debug instrumentation
- Exception collection
- License under
CC BY-NC 4.0
- Copyright issue feedback
me#imzye.me, replace # with @
- Not all the commands and scripts are tested in production environment, use at your own risk
- No privacy information is collected here