Being on-call, why, what and how.
Iâm assuming youâre in a software development team and you write software that gets deployed into production. Iâm also assuming you have paying customers, otherwise, your on-call program wouldnât exist in the first place. If all this is true, you should be on-call, at least for the stuff you write.
Being on-call gives a sense of responsibility and accountability for the code you deploy into production. You build it, you run it, you own it. You know that you need to be super careful with your code, otherwise, youâll get paged at night. You donât want it, I donât want it as well.
Incidents are the best learning platform; you have to debug your system looking for clues. You get to truly know your system while running it in production. Setting up meaningful logs and tracing provides you a way to track what your system is doing. Setting up metrics and alerts provides you a way to track how your system is performing. If you have an incident in production, you will re-evaluate these things. Youâll come up with more fine-grain logs and metrics, that will tell you even more about your system.
You shouldnât be on-call just for money, but if youâre in a company that provides you a stipend for being on-call, itâs a win-win. At the end of the day, they pay you for writing great systems and for being responsible for them out of working hours.
Iâll tell you what your organization expects from you: âcan you fix it?â.
Even if youâre a seasoned member, you shouldnât just land into an on-call rotation without experience it from others. Each system is unique and has its own traits. It takes time to pinpoint the possible root causes of an unknown system, so, at first, you should observe.
Someone has to coach you for one or more rotations. You should try to shadow on-call engineers. Youâll get familiar with a given system if you receive their own alerts, watch them looking for the root cause, and being the first to read the post-mortem.
After you get a sense of how things behave, you should be ready to start your first on-call rotation.
I call this observe, collect, act later.
You need to act fast, be accurate, and communicate a lot. A lot of text, during an incident, is not a lot of text. Make yourself a favor by communicating a lot during an incident, even if itâs a small thing.
You shouldnât deal with incidents alone during working hours. Everyone should be able to contribute and you must keep them posted.
You shouldnât hide what youâre doing to solve a production incident. Everyone interested should be able to jump on a call with you. Having extra eyes looking into the problem will help you solve the issue. Additionally, youâll get at least a free pair of eyes to review the post-mortem. Everyone involved in the incident should contribute to it.
Make sure you behave the same way during out of working hours incidents. It may sound creepy writing for yourself at 3AM. Once your team is up, they will thank you for keeping them posted.
If you belong to an on-call rotation, you must be able to solve incidents. If you have few incidents per rotation, how do you stay up-to-date with the system? Will you be able to solve a given incident two months from now without practicing? Google calls this âOperational Underloadâ.
Being on-call for a quiet system is blissful, but what happens if the system is too quiet or when SREs are not on-call
often enough? An operational underload is undesirable for an SRE team. Being out of touch with production for long
periods can lead to confidence issues, both in terms of overconfidence and underconfidence, while knowledge gaps are
discovered only when an incident occurs.
You build confidence in your system and on your on-call rotation if you practice a lot beforehand. A great solution to build confidence in your rotation is to take part in fire-drills. It shouldnât be difficult to set them in place every week to match each rotation. This way, everyone on-call can solve at least one incident per rotation.
Having weekly fire-drills will help you keep your runbooks updated. If your organization lack runbooks, fire-drills will get you started. Make sure you donât skip the creation of meaningful logs and metrics after an incident or a fire drill. All this combined will boost your confidence to solve incidents after hours on your own.
Sometimes, your metrics and logs wonât tell everything about your system during an incident. Youâll have to debug a given service in production. Make sure you practice how to do it during the fire-drills, otherwise youâll struggle a bit. Know your stack, from top to bottom, and learn how to debug things in production, you wonât regret it.
Good communicators benefit the whole organization, during and after an incident. Make sure you communicate during the incident, frequent and clear. Use others to review your post-mortems. Learn from the best contributors you know. Copy their writing style and keep improving. Everyone in the organization will thank you for being an excellent writer.
Keep practicing, keep writing đ