Skip to content

Handover Checklist from Dev team to SRE team

homepage-banner

Preface

Firstly, communicate with the R&D leader to instill the concept of operations and maintenance. To clarify, we will not be providing nanny-style operations and maintenance. Instead, we will focus on improving product strength from an operations and maintenance perspective, with a commitment to online service security, stability, low cost, and rapid iteration. R&D will handle development machines and testing environments, and we can help if needed, but we will not directly handle changes to the development environment.

Understanding of Business Overview

  • Identify the people involved in the business, such as the corresponding R&D colleagues, R&D leaders, testing colleagues, testing leaders, and product managers, and store their contact information. Create a group so that if any issues arise, we can contact the appropriate person.
  • Understand what the service does, what problems it solves, whether there are any comparable open-source products in the industry to facilitate quick understanding of the product.
  • Understand the upstream and downstream of the service, the services it depends on, the services that depend on it, and who the corresponding interface person is. Basic understanding is sufficient at this stage.
  • Understand the deployment of the service, which data centers it is deployed in, what programming language it is written in, whether the basic network, dedicated bandwidth, and data center exports are reliable, whether there have been any infrastructure problems in the past, and what the main pain points are.

Business Overview

Request R&D colleagues (or the previous operations and maintenance colleague) to prepare a PPT and give a business overview. Explain information that R&D colleagues want to communicate to operations and maintenance colleagues and information that operations and maintenance colleagues want to obtain from R&D colleagues. For example, detailed deployment topology, service architecture, data flow, testing change process, monitoring methods, which machines it is deployed on, how to log in to each machine, what modules are on each machine, whether OS parameters have been optimized, what considerations were made, which third-party software is used, and why Tomcat is used instead of Resin. Other topics include relevant wikis, fault handling contingency plans, common faults, and current online problems.

If there is a single point of failure in the business, do not accept it and ask R&D to change it. If the boss of the operations and maintenance department forces us to accept it, we will not take responsibility for any problems caused by the single point of failure.

Asset Management

The first step in formal preparation is to comb through assets. For example, what domain names are used, which domains correspond to which business, which virtual IPs are used, which services do they provide, which machines are used, which modules are deployed on each machine, which data centers are used, how much bandwidth is used, what the total bandwidth situation is, whether other businesses share contention for bandwidth.

Detailed information about machines is also needed, such as machine configuration, rack position, IP, management card IP, etc. The company should have a CMDB for querying. If not, operations and maintenance colleagues will need to build one.

Machine redundancy, spare parts, and machine models must also be considered.

Basic Monitoring

Once we know what assets we have, we can start monitoring them. For example, monitoring domain name connectivity/delay, virtual IP connectivity/delay, machine downtime, hardware monitoring, sshd/crond and other system process monitoring, the total number of processes running, and system parameter configuration monitoring.

Service Combing

Understand the architecture diagram, data flow diagram, and deployment topology diagram given during the overview presentation. From an operations and maintenance perspective, it’s best to also know the company’s network topology.

Understand each module’s situation, which machine it is deployed on, which directory it is deployed in, which account it is started with, where the logs are stored, which programming language it is written in, how it is deployed, whether it mainly uses CPU resources, memory, disk, or IO, how much resources it needs, what the utilization rate is, what the threshold should be set to for monitoring, whether watchdog should be used to automatically restart the service, which keywords in the logs trigger alerts, and any other issues that need attention.

Business Monitoring

Basic process/port survival monitoring, machine utilization monitoring, log keyword monitoring, log not rolling monitoring, and related service monitoring. Later, API-level monitoring will be done to promote business optimization.

Standardization

Machine naming conventions, operating system distributions, OS versions, and third-party software such as JDK, Tomcat, and Nginx should all be standardized and included in a standardization plan.

Service scaling, changes, and decommissioning should be automated so that each upgrade only requires a version number. R&D and operations and maintenance can both handle these upgrades, but permissions must be properly controlled.

Repetitive routine operations should be standardized into scripts for one-click completion.

Identify self-healing fault scenarios, abstract them into scripts, and automatically trigger them after receiving an alert for unmanned processing.

If the company has some infrastructure, such as name services, MQ, or a log platform, push for R&D to modify them to include new services. If the company does not have these infrastructure services, as operations and maintenance, we can start building them.

SOP

Contingency plans are very important. Before any online problems arise, we should anticipate possible issues and record how they should be handled in advance. When there is an issue, it’s best to follow the recorded steps. After all, people tend to be nervous during online problems, and following written procedures can help prevent mistakes.

Fault Simulation

Having contingency plans without testing them is unreliable. Therefore, a simulation should be run. Try to break modules and machines to improve online stability.

Especially when R&D says that a module going down will not affect availability. Try it out first, and it will likely save face.

Some simulated scenarios may be damaging. Should we still simulate these scenarios? This needs to be decided on a case-by-case basis. In most cases, simulations are still better because issues are more likely to arise when people are watching, rather than when they’re sleeping at night. However, large-scale basic network fault simulations should be avoided, as most businesses do not have data center-level disaster recovery.

After completing the above work, the basic work is complete. Many of the above tasks are one-time, so what will operations and maintenance do in the future?

In addition to spending some time on online problem resolution, we should focus on improving business product strength. Perform refined operations and maintenance. Do you remember the nine characters of operations and maintenance? “Safe, stable, efficient, low-cost.” This is the direction of our work. Here are a few examples:

Business Monitoring Enhancement

The previous section discussed business monitoring, which mainly focused on general monitoring indicators. After understanding the product well enough, we should do some business-specific monitoring. Pushing R&D to do this is also an option, as long as it achieves the desired effect.

For example, if you manage an MQ, you need to monitor the message backlog; if you manage an RPC service that provides three interfaces, you need to monitor the response time and success rate of these three interfaces; if you manage an S3 service, you need to monitor the short-term bandwidth increment of each bucket. Do you get the idea?

API Success Rate and Delay Statistics

Using Nginx at the traffic ingress point to perform success rate and delay statistics for all APIs of all business lines is essential. Find the TopN with low success rates and the TopN with long delays, and let the business optimize them. The boss will love this.

Online Problem Sorting

Organize all online problems and solve them one by one. Operations and maintenance can handle issues that they can handle. Operations and maintenance should focus on improving business product strength.

Cost Optimization

Through service mixing or a unified resource scheduling platform, machine resources can be saved. One machine can cost tens of thousands of dollars, so it is relatively easy to produce results.

Capacity Planning

Capacity planning and cost optimization are closely related. The focus of capacity planning is to plan and prepare corresponding capacity in advance based on natural growth and operational needs. Capacity may include bandwidth, dedicated lines, network equipment, machines, etc. When business volume goes down, related resources can be moved to support other business lines, allowing these hardware to operate at full capacity and be worth the investment.

Fine-grained operation and maintenance can come up with various things to do. In addition to doing this, another long-term investment is to build an operation and maintenance basic platform, such as monitoring system, deployment system, product library, resource utilization platform, domain name management, four-seven-layer access configuration platform, log platform, Trace system, etc. Well, operation and maintenance is actually quite busy.

About Communication

Finally, taking over a new business operation will inevitably have various communications with R&D. Every communication needs to write meeting minutes, send emails, and clearly state who is following up, what time it is, and send the email to the team email group and cc all parties involved. After the key node is completed, check. If it is not completed, communicate offline and reach an agreement before replying to this email with the conclusion, explaining the reason for the delay and the new time point. If communication is not smooth, let the boss coordinate.

Check-list

Reference

  • https://www.jianshu.com/p/65a01b5d61c7
  • https://mp.weixin.qq.com/s/8hRvMaZCD38GEsrbZ8eVvA
Leave a message