A TLOC (Tech Lead On Call), who drives the investigation and makes technical decisions.An IMOC (Incident Manager On Call), who is responsible for spearheading a speedy mitigation, coordinating SEV respondents, and communicating the status of the incident.A SEV level from 0-3, to indicate criticality 0 is the most critical.A SEV type, which categorizes the incident’s impact well-known examples include Availability, Durability, Security, and Feature Degradation.The SEV process at Dropbox dictates how our various incident-response roles work together to mitigate a SEV, and what steps we should take to learn from the incident.Įvery SEV at Dropbox has several basic features: That is where incident management comes in. However, no matter how much we probe and work to understand our systems, SEVs will still happen. These include Chaos engineering, risk assessments, and systems to validate production requirements, to name a few. To safeguard these targets, we have invested in a variety of incident prevention techniques. We set the bar even higher for ourselves internally, targeting 99.95% (21 minutes per month). To stay within our SLA of 99.9% uptime, we must limit any down periods to roughly 43 minutes total per month. We define this based on the overall availability of the systems that serve our users, and officially cross from “up” to “down” when our availability degrades past a certain threshold. Beyond this, Dropbox commits to an uptime SLA in contracts with some of our customers, particularly in mission-critical industries. Every minute of impact means more unhappy users, increased churn, decreased signups, and reputation damage from social media and press coverage of an outage. Not only do we closely measure the impact time for our availability SEVs, but there are real business consequences for this metric. Success means shaving every possible minute from that response. Critical availability incidents are often the ones most disruptive to the largest number of users-just think about the last time your favorite website or SaaS app was down-so we find that these SEVs put the highest strain on the timeliness of our incident response. No online service is immune to these incidents, and that includes Dropbox. #What is dropbox cost structure series#(For those who are less familiar with the topic, we recommend this series of tutorials by Dropbox alum Tammy Butow to get an overview.)Īvailability SEVs, though by no means the only type of critical incident, are a useful slice to explore in more depth. The basic framework for managing incidents at Dropbox, which we call SEVs (as in SEVerity), is similar to the ones employed by many other SaaS companies. (Their usefulness will depend on your tech stack, org size, and other factors.) Instead, we hope this serves as a case study for how you can take a systematic view of your organization’s own incident response and evolve it to meet your users’ needs. You probably won’t find all of these in a textbook description of an incident command structure, and you shouldn’t view these improvements as a one-size-fits-all approach for every company. This post goes deeper into some of the lessons Dropbox has learned in incident management. The tweaks we’ve made over time include technological, organizational, and procedural improvements. The key components of our incident management process have been in place for several years, but we’ve also found constant opportunities to evolve in this area. Every minute counts for our users during a potential site outage or product issue. Though we also employ proactive techniques such as Chaos engineering, how we respond to incidents has a significant bearing on our users’ experience. At Dropbox, we view incident management as a central element of our reliability efforts.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |