The following document is intended as the general trip report for me at the 2022 SREcon Conference held in hybrid mode (in person in San Francisco CA and virtually) from March 14–16, 2022. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
Opening Remarks
As with every other USENIX conference I've attended in the past 39 years, the opening remarks started late. In fact, at 9:06am PDT we were told that the 9:00am session would start at 9:30am, that all the morning sessions would be 30 minutes later than originally scheduled, and that lunch would be shortened by 30 minutes to make up for it. (After the day ended remote participants were informed that the delay was due in part to the on-site daily COVID testing regimen. They planned to add staff to alleviate the situation for Tuesday and Wednesday.)
The revised session started at 9:34am. The chairs thanked the on-site participants for obeying the daily testing protocols, the 150 talk submitters, the program committee, USENIX staff and USENIX board, and all of our sponsors. They advised participants, both onsite and remote, to use the Slack channels for discussion. They reviewed the WiFi SSID and password, quiet space, evening in-person events, and the next conference (SREcon22 EMEA in October).
Plenary: The "Success" in SRE is Silent
[Direct slides link] [Direct video link]
Establishing the Return-on-Investment (ROI) for Availability work/projects/investments is really hard. There are some strategies for doing it, which the speaker explained in the meat of the presentation. If we don't change the course of common discourse regarding ROI, we're going to miss huge opportunities to invest in SRE, or worse, end up in a regulatory hell that sucks all the fun out of it and produces worse outcomes.
[Sadly, I couldn't take reliable notes for this talk because Vimeo continued to randomly and intermittently drop audio, freeze video, or both for anywhere from 1 second to 5 minutes... including the entire Q&A session. I'm hoping things (a) stabilize and (b) are available on YouTube later.]
Plenary: Building and Running a Diversity-focused Pre-internship Program for SREs
[Direct slides link] [Direct video link]
At Meta, almost all of their early-career Production Engineering hires are former university interns. But most university students do not even know that something like PE/SRE exists as a career, or have an opportunity to get any training or coursework towards an internship or career in SRE. To address these gaps, in 2021 we started our first SRE-focused, fully remote pre-internship program, with 96 college students, focusing on diverse participants who would not normally not be recruited through our traditional recruiting pipelines.
In this talk, they explained why and how we started this program, gave extensive details about program administration, curriculum, and our learnings, as well as some preliminary results from the program. It is our hope that audience members will walk away with an understanding of what we did, and be able to go back to their own organizations to implement a similar program.
Improving diversity is essential, and doing so requires early-career hiring. Recruiting from more diverse environments doesn't necessarily work:
- Interest — SRE isn't necessarily well-known or -understood as a career path.
- Theory — SRE concepts aren't necessarily taught in school.
- Practice — Students lack interview skills so passing a tech interview is (more) difficult.
- Access — You can't be what you can't see.
Most early-career hires are former interns who come from top CS schools and return to help recruit. This is a self-reinforcing loop which is great for those schools, but less so if those top CS schools aren't themselves diverse. So they created a diverse group of ~100 college students for a pre-internship fellowship program.
They wanted to include the "big three" aspects that an SRE needs to be successful:
- Knowledge — OSes, databases, containerization, coding, troubleshooting
- Culture — Reliability, release mgmt, incident mgmt, SLAs
- Mentorship — Career paths, encouragement, inspiration, interviewing tips
They created a 12-week fully-remote summer program combining a technical curriculum, professional mentoring, and interview practice, along with weekly industry speakers. They'd work in teams to learn more about collaborating. The fellows would have a final project to build a portfolio to use after the fellowship ends.
They had three pillars:
- Meta Team — A small core team with strong executive support and a larger supporting group (manager, 4 coordinators, sponsor, recruiting partners, speakers, panelists, interviewers, and a dozen mentors).
- Curriculum Partner — The Linux Foundation granted them a royalty-free license for key Linux instructional content for use in the program.
- Fellowship Administration Partner — Major League Hacking, the people who'd recruit students, manage the day-to-day running, and disbursing funds.
The fellowship outline for 2021:
- Week 1 — Intro to SRE/basic tools
- Week 2 — bash scripting/shell printer
- Week 3 — Services and databases
- Week 4 — Containers
- Week 5 — Testing and CI/CD
- Week 6 — Monitoring
- Week 7 — Career week
- Week 8 — System troubleshooting
- Weeks 9–12 — Final project
They had 10 pods of 10 students, each with three groups of 3–4 students, meeting with the pod leader daily and the mentor weekly.
This was successful: 95% of the fellows graduated, 77% spent 25+ hours a week on the fellowship, and they had 95% daily attendance in sessions. Over the 12 weeks the knowledge of SRE as a career path increased substantially. ~50% have jobs/internships already lined up; while only 23% have SRE-type roles, 56% are using SRE skills in their jobs and 87% said the fellowships gave them advantages in the interview process... and not just working at Meta, but at other large companies.
What needs improvement for 2022?
- Less theory and more project work. They didn't understand why they were being taught some theories until the end. The project will continue throughout the whole process and use the concepts as they go along.
- More interview practice. Coding is essential for SRE-type roles so there'll be weekly focus on that.
- Faster identification of trouble spots. It's easy to have people, even an entire pod, slip through the cracks and fall behind. They'll institute weekly pod-level surveys to make sure people are getting the materials.
- Better recruiting support. Even when Meta fellows applied to Meta there was no way of tracking them through the recruiting process.
Thinking ahead, what's next? They want to scale the program 10x: Seeding new schools, running similar programs at other organizations, and considering open sourcing it under the USENIX banner.
How can we help? Hire their fellows! Support a similar project at our companies. Take advantage of speaking, panelist, or curriculum opportunities.
A Postmortem of SRE Interviewing
[Direct slides link] [Direct video link]
The speaker discussed how we can make the interviewing experience better for all parties involved (interviewers, interviewees, hiring managers, and recruiters). The session covered everything from application through offer acceptance and discussed actions we can all take to make SRE hiring better.
- Summary of the journey — Last March he applied to 37 companies and had 113 interviews of different formats. He recognizes his privilege in this situation (self- identifying as a straight white male applying for senior or principal roles and with relevant broad experience). So far of the 37 applications, including four referrals, he has 12 non-responsives (including some from the referrals), one generic rejection, and seven non-fit/salary expectation mismatches. After the technical phone screens, two companies had already filled the position and two more went MIA, which led to 12 on-site interviews. After the on-site interviews, there were three rejections and one cancellation, which led to eight offers, one of which he accepted. All of this was over a period of 2½ months.
- Behaviors companies should avoid:
- Not responding to applicants. We're all busy but a form letter thanking the applicant for applying and acknowledging its receipt (especially if there's an employee referral involved) is an easy win.
- Not following up after interviews. You may need to let the candidate know that the process may be slow but don't just ghost them.
- Not setting or following timeline expectations. Be clear about how long each step will take. Ask the candidates if they have any time pressures. You should be able to go from final interview to offer in less than a week; dragging it out is unfair to the candidates.
- Non-transparent interview process. Have a standard process that's disclosed to the candidate. Avoid adding extra rounds to the process.
- Unstructured interviews. Interview modules should be standard for the position. Don't repeat the same question multiple times in the process. If you can't tell whether a candidate is right for the position you need to make changes.
- Bad interview modules. They're a representation of your company or organization. Having practical, real-life questions show you're serious about the candidate's real-life experience; it's not a CS exam.
- Better interviewing practices:
- Recruiter screening — They're a good first stage. Tell the candidate the compensation package and any policies on remote versus on-site work, and what percent of travel is expected, as well as set expectations on the interview process and timeline.
- Recruiter followup — Recap the details, preparation materials for first interviews, and company benefit package brochures.
- Technical interview — The interviewer should introduce themselves and their projects, describe the interview format, and explain what they're looking for as an interviewer. The candidate should be concise when it's a straightforward answer, ask clarifying questions if you're not sure of what's being asked, admit when unsure or if there's a knowledge gap, and use the interview to find out more about the role.
- Hiring manager — The interviewer should be crystal clear on team duties and responsibilities, with no bait-and-switch. Explain expectations and professional growth opportunities. Don't make promises you can't keep. The candidate should understand the role, expectations, and projects, and ensure there's work that will make them happy. Understand the annual review or compensation mechanisms and understand the oncall and work/life balance considerations.
- Things companies did that blew him away (positively):
- Preparation packages explaining the interviews, what the company is looking for, and suggested practice materials
- "About the team" documents explaining roles, leadership, and organization structure
- Benefit package documents with clear information about the company benefits
- Choice of communication methods
- Sell calls with C-suite or VPs to help answer candidate questions
- Welcome or care packages (and bonus points for handwritten notes)
- Fun and challenging interviews: In an architectural module, designing a real-world UDP server with real life constraints; in coding, building a prometheus exporter in 60 minutes; with the hiring manager, talking about professional leadership challenges and his accomplishments; in presentations or essays, demonstrating his writing and presenting it for peer discussion.
- Thoughts on preparing for SRE interviews:
- It's both an art and a science. Lots of different skill sets are tested.
- Practice makes perfect... ish. Use question repositories, flash cards, LeetCode/HackerRank, and example projects.
- Checklists: Practice how you want to approach system design or coding questions.
- Know what you're looking for in a job.
- Try to avoid your dream job being your first set of interviews.
- Once you get an offer:
- Understand the details of the compensation package including equity timelines/value and bonus structure. Don't be bullied into saying you're happy.
- Ensure you're given an official set of benefits.
- Don't agree to anything not in writing.
- Do your research (Glassdoor, levels.fyi).
- Don't be bullied into accepting immediately.
Self-Destructing Feature Flags
[Direct slides link] [Direct video link]
You deployed an optimization behind a feature flag, tested it yourself in production, and QA signed off. After enabling the feature flag, you monitor error rates for a while between this version versus the last and there has been no significant change, so you decide to go to lunch. Right after you walk out the door, the error rate skyrockets because a group of power users from your largest clients have all just come online.
Obviously the answer is to disable the feature flag, but you're AFK so you have to hop onto Slack to tell someone on your team that that feature flag is the likely culprit so they can disable it. What if they didn't have to disable it manually? What if we automated that? This would be a complete non-event. This approach is what this talk was about.
Note that the Ruby code in the slide is intentionally naive in order to fit and look simple enough for the conference talk.
Taken from the VOID: The Scary Truth about Incident Metrics
[Direct slides link] [Direct video link]
This talk presents research collected from the Verica Open Incident Database (VOID) — a new open database of public incident reports. Containing nearly 2,000 reports for 660 organizations, the database allows for more structured review and research about software-related incident reporting. Key results from their research challenge standard industry practices for incident response and analysis, like tracking Mean Time To Resolve (MTTR) and using Root Cause Analysis (RCA) methodology. In particular, they demonstrated how unreliable MTTR can be, and how RCA can lead to environments where people are less likely to admit mistakes and speak up about things that could lead to future incidents. They proposed alternate metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organizations better learn from their incidents, and make their systems safer and more resilient.
We all have incidents, and we're solving a lot of similar problems in silos and not sharing what happened or (better yet) what we learned. The VOID has 1,856 publicly-available incident reports from 610 organizations from 2008 to date in a variety of input formats: Social media posts, status pages, blog posts, conference talks, news articles, and even some comprehensive retrospectives or postmortem reports. It also contains metadata on the organization, incident date, report date, report type (primary or secondary), duration (which can be fuzzy), technologies involved, impact type, and analysis format.
53% of incidents in the database were (externally) resolved in under two hours so in general we're good at our jobs. The problem with MTTR is that if you don't have a bell-like distribution curve then the metric isn't necessarily meaningful. Looking at the data and applying monte carlo simulations, fully 49% of the cases had a longer recovery time. Longer time to recover isn't necessarily a problem for your customers.
We aren't yet collecting socio-technical incident data: We don't know SLOs, customer feedback, or how many people across how many groups with how many tools are involved. Focusing on themes and narratives will help you identify patterns and similarities across incidents. Also, consider studying near misses (cf. Carl Macrae's Close Calls).
Most (74% of) places don't do RCA although Google and Microsoft do. One problem is that root cause implies a single cause and it's where we stop looking (creating an artificial stopping point and can lay a path to blame).
Thesis: We need a new approach, mindset, toolset, and skill set for talking about, analyzing, learning from, and sharing incidents:
- Treat incidents as opportunities to learn.
- Favor in-depth analysis over shallow methods.
- Treat humans as solutions not problems.
- Study what goes right along with what goes wrong.
Download the 2021 VOID report.
How We Survived (and Thrived) During the Pandemic and Helped Millions of Students Learn remotely
[Direct slides link] [Direct video link]
Everything was going well until COVID-19 emerged. A pandemic-induced traffic surge across all of their platforms which consists of hundreds of services was something unexpected. Not only did they survive this traffic surge, but they thrived without incurring huge infrastructure costs or compromising security. Was it adding capacity or implementing CORE SRE principles that saved the day? Learn how they applied "Upstream Thinking" and were proactive in dealing with crisis situations in the context of SRE.
Their success was silent: They didn't hire new employees to scale; they were built for scaling and prepared in advance for scaling the growth. How?
- Non-technical:
- People — Don't leave anyone behind, bring cloud l;earnings into the whole company, form CoEs and advisory councils
- Processes — Smart processes, be agile not a bottleneck, govern via preventive controls
- And — THIRD THING MISSED
- Technical:
- Governance first — Rules and regulations, shared services, and preventative and detective controls.
- Everything as code — Infrastructure, cybersecurity, acceptance criteria; accountability, approvable, and traceability through Git; reusable building blocks, supported, and shared.
- Security and networking — Think forward and plan for growth, and create a cloud security policy.
So what's next?
- Serverless
- Network segmentation
- Zero-trust architecture
- Deprecation of SSH
- Intent-based networking
- AI Ops
- Advanced automation
The Pandemic and the Classroom: Enabling Education for Millions
[Direct slides link] [Direct video link]
COVID-19 impacted education from kindergarten to college in unforeseen ways almost overnight. The world had to adapt to an online-only mode of education. What happened behind the scenes to make that happen is unlike any typical growth story for a large product or service. This talk described the problems, solutions, and design patterns that Microsoft Teams leveraged to be successful at re-enabling education for millions of remote students. They focused on education-specific scenarios ranging from seasonal tenant onboarding to optimizing for school hardware to changing customer expectations.
After a brief overview of Microsoft Teams, he talked about education audiences, and the differences between educators, students, parental involvement, and IT administrators between K-12 and higher education. Pre-COVID most scenarios were document sharing and note taking. Then in March 2020 everything closed and went full-remote... immediately. Scenarios changed to be more meetings-based. They made changes to the infrastructure to allow for brownouts, disabling features in different regions during their peak load, until they met criteria for going back to normal operations. This was to buy time while they made other changes.
There wasn't anything here we didn't already learn ourselves.
Applied Science Fiction: Operating a Research-Led Product
[Direct slides link] [Direct video link]
Scientists have increasingly become a part of many product teams, bringing many new skills to the table. This has been especially true in their augmented reality (AR) products, with their core computer vision pipelines being entirely researcher-led. With new opportunities comes new challenges, and they have had to adapt many of their SRE and product engineering practices to ensure everyone feels productive and supported. This talk shares the lessons they have learned in building a healthy and happy product team in a researcher-first environment.
He was a very fast speaker and it was difficult to take notes. But generally they worked to simplify things so they could let the researchers do what they needed to within guardrails, improved their CI/CD environment, and improved the code review process, using containers. They leave old versions intact but time-locked because some researchers need the same versions of things for their research (think blue-green deployment with a larger rainbow). They've autoscaled where possible and every version is running but with no resources until it's asked for. Most of their pipelines are the complex system defaults, but they do have one-offs generally in staging for researchers to build new ones.
(Monitoring tip: Anything with a metric should have a histogram associated with it.)
Tools include Django, Grafana, Loki, Domo, and Slack notifications. Virtually all their operational platforms are open source. The science code is custom for the researchers.
In summary:
- Engineering tools work for research too... as long as you build them well.
- DAGs are your friend.
- New versions as new deploys.
- Retry failures everywhere (in production; they're disabled in testing/QA).
- Observability is for everyone.
- Plan together; succeed together.
Taking the 737 to the MAX
[Direct slides link] [Direct video link] [Trigger warning: Plane crashes discussed]
Ten years ago, Boeing faced a difficult choice: The Airbus A320neo was racking up orders faster than any plane in history because of its fuel efficiency improvements, and Boeing needed to compete. Should they design a new plane from scratch or just update the tried-and-true 737 with new engines? The 737 MAX entered service seven years later as the result of that and hundreds of other choices along the way. Let's look at some of those choices in context to understand how the 737 MAX went so very wrong. We'll learn a thing or two along the way about making better decisions ourselves and as teams.
The speaker advised us to sit back and enjoy the talk instead of taking notes because he also talks quickly. In summary, you have to think about your systems as a whole entity or, like the problems with the 737 MAX rollout, you may miss something.
Securing Your Software Delivery Chain with Process Auditing
[Direct slides link] [Direct video link] [Trigger warning: Mental health]
Tasked with "securing the supply chain" for your employer due to a high profile CVE or breach? Overwhelmed by vendor pitches and trying to find some data to start tackling the problem? Curious about what's happening when an application is executing for some other reason? Want to know what you can discover about un-instrumented applications? This talk went over how to use strace and eBPF to discover what applications are doing and how to improve our security posture with that knowledge. (View the demonstration on the recording.)
Vulnerability is more than just technical. Vulnerability, compassion, and self-care for humans is important as well. Remember to breathe and that we're all humans. Over the past two years especially, many of us have felt or are feeling like the proverbial frog in a pot. Many of us are finding our emotional triggers being easier to hit.
- Don't try to deal with everything at once.
- Don't try to do it all yourself.
- Don't forget to check that the mental health prereqs (hydration, hygiene, etc.) are taken care of as well.
- Know when to ask for help, even if you're not sure how.
Are We There Yet? Metrics-Driven Prioritization for Your Reliability Roadmap
[Direct slides link] [Direct video link]
Reliability is a journey for any organization and unfortunately, we can't just skip to the end. There is what feels like an endless amount of work that could be done to help improve reliability and we often find ourselves in situations with more things to do than there is time to do it. It can be overwhelming to figure out where to start, what to do next, and how to get support (and funding) from leadership for it. Prioritization and tradeoffs have to happen whether you are building out a new SRE team, diving in when everything is on fire, or even leading an existing SRE team that is performing well. This talk gave SRE leaders a metric-driven framework to prioritize what will make the most impact in our own environments based on real data and highlight the successes throughout the journey.
Categorize areas of impact for an SRE organization:
- Ask the right questions — Engineers can often identify opportunities for improvement, but there's almost always more to do than resources to do it. Assembling opportunities into themes lets us dig deeper. This lets people break down the work into more-digestible chunks. The themes tend to fall into four categories:
- Change management — Are we able to make changes confidently and quickly?
- Monitoring and detection — Do we know when there's a problem?
- Incident management — Do we respond quickly when there's a problem?
- Continuous improvement — Are we getting better?
(These are sample questions for a mature SaaS company.) Once you know the business needs (see below) you can determine which category or categories are most important to you right now.- Measure what suits you most — Now that you know what matters most to the business right now you can start choosing metrics. For example, for CM there's cycle time, deployment frequency, and change failure rates, and monitoring may look at SLIs, SLOs, error budget, MTTD, and percent of customer detected issues. Be aware of unintentional consequences (for example, "move faster" may lead to lower quality to game one metric).
- Perform gap assessment — Where are you, where do you want to be, and what's the difference between them? That's the gap.
- Create a reliability dashboard that rolls up the current metric, target metric, and gap — Include metrics you may not have current values for, so you're not hiding the fact that something may not be measured. You may want multiple views based on the audience, or allowing for drill-down into the details.
- Understand the business needs — Using the business' language makes your requests easier for them to understand you.
SRE Stands for... Skydiving Resilience Engineer
[Direct slides link] [Direct video link]
Skydiving, especially more than once, can be viewed as an exercise in risk management (and also a lot of fun!)
This talk took a look at the safety procedures practiced by the skydiving community that relate to the guiding principles of SRE. We discussed the evolution of altimeters and software metric systems, highlighted the dangers of hero mentality and overconfidence, as well as drew parallels between game days and incident management with pre-jump preparations and free fall plan execution. This talk shared concrete examples and parables that will give attendees a visceral understanding of soft concepts such as the "bus factor," blameless postmortems, the Goldilocks approach to tribal knowledge, and the importance of developing emotional resilience.
"If you can't see through it, don't go through it. You don't want to meet high-speed aluminum."
Building a Path to the Future: Mentoring New SREs
[Direct slides link] [Direct video link]
Mentorship has long been an important part of bringing people into the operations and reliability space, especially people early in their careers, due to the relative lack of formal training available in these fields. However, being a good mentor is not always easy — and the pandemic, with many of us moving to remote work for the long term, has made it even more difficult. This talk discussed strategies you can use to be a more effective mentor, even in a remote work setting.
Why do we need to be better mentors?
- There aren't many formal programs for teaching SRE work.
- The hiring pipeline and workload are extremely top-heavy.
- Remote work is increasingly common so passively learning by osmosis is harder.
- Mentorship is a force multiplier.
What makes a good mentor?
- Time
- Empathy
- Patience (especially giving them a chance to fail)
- Organizational support
Some guiding principles:
- Be welcoming. The mentee needs to feel that you're glad to be there and to be mentoring them in order for them to feel psychologically safe.
- Be vulnerable. Show that you're not infallible and let them know it's okay to be wrong.
- Be explicit. Actually go deeper into things and not just on the surface-level description. New people in the field may not know the concepts, tools, or jargon that's "obvious" to people who've done it for a while. Explain the whys of the decisions and of best practices.
- Be available. Half an hour a week isn't going to be enough, especially at the start. Set aside time for pairing on tough problems. Check on them often to make sure they're doing okay — and you need to reach out since they won't necessarily trust you yet. But keep your boundaries about when you are or aren't available.
Some techniques to help be a better mentor:
- Documentation. Gives mentees something to use as a foundation or reference. Explain not just how but why. Including context helps them apply knowledge to undocumented situations. Doesn't have to be your own documentation; referencing others' documentation, even vendors' is fine.
- Office hours. Sit on a regularly-scheduled open Zoom call whether people stop by or not for 30-60 minutes. This provides a time to let people know you're available for questions.
- Pairing. It's time consuming but one of the best ways to let the mentee develop muscle memory. It lets them drive. Alternatively you can "ride shotgun" and be available in real-time if they have questions or concerns. Just being there "just in case" is a big confidence booster for them.
- Talk to yourself. Use Slack threads in open channels to talk through your problems. New engineers can see you don't know everything and so it's okay they don't either. It also helps make your thought processes explicit and prompts discussion that can also illuminate thorny issues.
- Orienteering. Let them know about professional development. Know where they are and where they want to go and how you can help move up to the next level. What areas are they most curious about? In what projects do they find the most excitement or joy? Show progress landmarks and make sure others recognize it too. It may mean they leave your organization if what you do isn't what they want to do.
Some issues she's run into:
- Unknown unknowns. There are a lot of unknown unknowns in the SRE space as opposed to, say, application engineers. We often get unique situations and it can be frustrating and make them feel lost. Help the mentee time-box their exploration ("escalate after 10 minutes in an incident, 30 minutes otherwise"). If you pair with them, make sure you explain both how and why you got to the decision. (This is very similar to incident retrospectives.)
- Abstractions. SRE work, especially incidents, tends to be where abstractions break down or become obfuscations. Some things can run in parallel and others require destroy-and-create. Other examples: How DNS works when the TTL is ignored. Containers aren't really self-contained.
- The bubble. More of a meta-problem. It's easy to get locked into the "The way we do it is the right way" mentality. Encourage people to get involved in industry communities (like SREcon or Slack). Explain how what we do here isn't always how it's done in other places.
A note on sponsorship: Mentorship is only half the story. You can give them all the tools but without support they'll get frustrated. Be a fan of your mentees. Recommend them for projects or as a resource for their desired areas.
From the Q&A: Certifications aren't something she's worried about; studying for them and learning the stuff is more valuable than the actual certification. She mentions some books in the slides that are more generic (including Tom Limoncelli's The Practice of Cloud System Administration). Some skills you need include some basic understanding of statistics, being able to read multiple coding languages, and details about whatever tools are used in your organization. Learning principles and how things work is more important than specific tools or languages. Having a formal rubric about what level 1 versus 2 versus 3 means.
A good question mentees can ask isn't a yes/no or how-to, but one asking about why, the context, and even the history.
How the Metrics Backend Works at Datadog
[Direct slides link] [Direct video link]
Datadog is a popular cloud monitoring service which operates at scale in all three major cloud providers, ingesting 10s of GB/s of points across many billions of times eries into PiBs of hot and cold storage. Naturally, reliability is paramount.
This talk showed how their very large distributed system works today, and how it grew from a very small not-distributed system. They shared the most interesting scaling and reliability challenges they faced along the way, how they solved them (for now), and some important lessons and strategies which emerged. They also shared a couple of bonus problems which are still very much unsolved today, and what they're planning next.
There are always more failure domains (nodes, zones, clusters, areas); how likely are they and how willing and able are you to handle those failures if you haven't mitigated them?
You might not need a fancy sharding scheme and unique ID; you might just be able to work with multiple levels of simple ones. Also, moving from one to two is harder than moving from two to more; consider starting with two (or more) if you can. Coarse sharding schemes don't work for outliers (and that's okay). You'll need a smart simple sharding strategy... until the edge cases become too unmanageable (e.g., Google has a paper on Slicer that gives the concepts (though it's not open source)).
Disney Global SRE: Creating Video Magic
[Direct slides link] [Direct video link]
Disney is one of the world's largest media companies and home to some of the most respected and beloved brands around the globe. Embracing the latest technology is an important strategic focus at Disney, allowing guests to better connect with Disney and allowing Disney to better connect with guests in innovative and delightful ways.
This talk told a story about a century-old organization that has scaled its SRE practice to ignite digital magic across the globe. This team of SRE Jedi Knights are on a mission to foster curiosity, communities of practice, and technology awesomeness while venturing where no SRE has gone before. They delivered epic stories of successes, setbacks, and failures while pushing large-scale platforms to their limit and delivering the best in-seat, digital experiences, products, and content to our guests and subscribers across the globe, showcasing some of the technology and automation we have built.
(Their director of SRE was at an 11 the whole time and was so pollyanna positive it set my teeth on edge. Nothing was spectacularly unusual or groundbreaking in terms of technology or politics.)
Ten-Year Journey to 10,000 Production Machines
[Direct slides link] [Direct video link]
What does it take to scale provisioning and automation management to 10,000 machines? They covered a decade of improvements starting from humble wrappers on a Chef with a 50 machine limit, through API optimizations to multi-threaded maintenance systems and shared indexes that handle 10,000 concurrent operations.
He told the story of their 12-year growth, starting with a high-level overview of their growth and then with individual anecdotes later. They had version 1, added stuff they forgot, and that... made it scale worse. Moving from v2 to v3 (ca. 2016-2018) required a lot of rewriting and a lot of customer resets; v2 was doing too much so in v3 simplicity was critical. Their goal was to run the same code at every customer as an Infrastructure as Code product. After the v3 rewrite their CLI was effectively an agent. They had to close-source some of their intellectual property because prospects would say "Make these changes, we won't pay for it; the privilege of us using your software is enough."
One site had so many machines it generated 10M job logs, and the startup time of loading them took over two hours in their lab to bring the system up. Improving the pruning algorithms reduced the number of job logs and let them break the 10k machine limit.
Dark Sky Camping: Reducing Alert Pollution with Modern Observability Practices
[Direct slides link] [Direct video link]
Over the course of the pandemic, several factors converged to create an amazing problem at Campspot: More traffic! Increased load stressed our applications and unpleasant customer- facing incidents stressed their engineering teams. In response, they doubled down on existing tools and processes: Increased alerting, beefed up on-call rotations, more dashboards, and more high-urgency Slack channels. They put spotlights on so many areas of the system it became hard to see where issues were.
Recognizing the chaos, they pivoted in spring 2021 to unify teams around a single observability tool and implemented Service Level Objectives. The result: Fewer alerts, faster troubleshooting, and clearer indicators of when to focus on performance versus features. They talked about how they cleared out the alert pollution so they could see the constellations they were actually searching for all along. If you're building a case for the move to observability, this talk is for you.
Using Serverless Functions for Real-time Observability
[Direct slides link] [Direct video link]
This talk was about how a sub-second latency query engine inspired by Facebook's Scuba was extended from running in RAM only, to querying the most recent data that could still fit on local SSD, to querying months of data at a time using cloud storage and serverless functions. They described the pitfalls of managing AWS Lambda at scale, including impatience, maximum concurrency, runtime and architecture configuration and experimentation, and the price/performance of renting 20,000 parallel workers.
Some takeaways: Serverless can benefit any workload involving people having to wait:
- Move state from local machines onto object storage.
- Shard list of objects into work units.
- Parallelize object processing.
- Reduce results outside Lambda afterwards.
But:
- Avoid latency-insensitive batch workloads (cost)
- Avoid tiny workloads (setup latency)
- Check cloud provider limits and state our intentions (capacity planning)
Before scaling:
- Ensure it's tuned properly (items/invoke, CPU/RAM ratio)
- Ensure your code is optimized properly (esp. If multi-architecture)
- Ensure you use observability layers (e.g., OTel layer)
- Measure metrics (esp. cost) carefully
This was interesting and both speakers were entertaining, but the information is not of immediate use. RCI Research might want to view this talk if they're doing a lot with AWS Lambda.
Improving How We Observe Our Observability Data: Techniques for SREs
[Direct slides link] [Direct video link]
Time-series charts have been around for hundreds of years, yet were originally created with a narrative intent often missed by many engineers today. This session examined some historic time-series examples, explored what we can learn as SREs, and looked at concrete charting techniques we can use to improve the cognition of our SLO narratives, engineering reports and incident retrospectives using multivariate relationships, small multiples and sparklines, while avoiding some common pitfalls often found in engineering presentations.
Most of what we really want to see on our dashboards is multivariate. Visualizing SLIs as a multivariate SLO narrative. Graphics reveal data because we're supposed to see it.
Common pitfalls that distract from good observability in visualizations:
- Use common scales when displaying similar metrics; don't autoscale/fit-to-range.
- Avoid obscuring your data with shaded and stacked data sets; the data in the foreground can obscure the data behind it. (Exception: Bar and pie charts.)
- Provide scale, keys, and other critical explanations with your graphics.
- Avoid truncated or torn graphs; show the whole scale or make it obvious that something is being elided.
- Not linking back to the source URL (for including in an incident report or retrospective later).
- Percentages and averages. "30% cpu" is meaningless without knowing the number of threads and so on. "25% error rate" is bad except when you have only four events. "Average" is usually used when people mean "mean" or "median." Consider using p50, p75, p90,and p95 instead of min and max.
Other tricks: Use sparklines, small graphics the size of a word, that can be inline with text. A good example is a single security's high/low trading on the day. (brew install spark lets you use spark on the command line.
Look at Tufte Chartjunk for more on graphing data and his "PowerPoint is Evil" article for Wired.
In conclusion:
- Build data narratives.
- Use multivariate information displays to demonstrate relationships of critical metrics in incident retrospectives and dashboards.
- Increase data density and increase cardinality by relating your key metrics together on a single chart by utilizing orthogonal axes for different metric types.
- Increase data density in tables and paragraphs using sparklines.
- Avoid common pitfalls that distract from good observations of your data.
Note that observability means different things to different people. Is it a system with instrumentation so we can see what's happening, or is it the human-consumable form of visualization, or is it how we perceive or cognate that visualization?
Principled Performance Analytics
[Direct slides link] [Direct video link]
This talk presented an exciting analytical method that is successfully delivering high fidelity insights useful in analyzing and diagnosing distributed systems. It has been used in production in a variety of complex services at scale (up to 1.4T events/day), where traditional methods have failed, with good results. They sketched out the problem domain in detail and presented the statistical methods used, as well as the intuition behind the approach. Their goal was for us to gain an alternative lens through which we can analyze performance, as well as an understanding of pitfalls.
They started with a simple question: "Is it working?" They realized that they could tell if the service was on fire, doing something, or looked busy. But the customers care about whether it's working. We should be able to answer yes or no but most of us can't, definitively, today.
Terminology is fluid or mushy. For reliability during this talk they mean a combination of three things:
- Availability — Is the service available?
- Performance — How effectively is work performed?
- Correctness — Does the service do what's expected?
(And even those three terms are fluid or mushy.)
We started with "Page me if there's a problem." Service Level Objectives (SLO) make that less of a problem, but we've overloaded SLOs:
- Encode system goals
- Specify behavior expectations
- Determine when to page
- Bound emergency behavior
- Enable error budgets
- Indemnify for dependency problems
- Coordinate priorities between teams
- Estimate outage magnitude
- Signal service maturity
- Bound supported behavior
SLOs can see catastrophic behavior and some kinds of problems but not everything. The SLO idea is based on events that can be determined as good or bad, and the number of bad over the total is the percentage. Great, right? No; errors are shallow data and shouldn't always be rated equally. SLOs require recognized errors and errors are ambiguous as a convention (for example, the code reports an error if certain conditions are or aren't met). Bugs and calibration errors can result in over- or undercounting. This leaves lots of room for problems and there's no regular maintenance cycle. All of that results in poor data products.
So now what? Look at reliability via performance analytics.
- Consumers care if the service is working and meeting expectations.
- Providers care if the service is working as it should.
- When there's a problem, both consumer and provider care which side (if not both) the problem is.
Service providers see workload performance across all customers (unless they're tagging customers). Complications:
- Mix shifts in workload
- Environmental factors like contention
- Mixed environments, job priorities, etc.
Customers or consumers know if a workload is performant (even if we as providers don't). They want services to be consistent over time. (We aren't defining well- or poorly-behaved services).
Solving service reliability is now trivially easy:
- Partition workloads by intent.
- Analyze performance.
- Profit!
They have a technique called 2σ. Hypothesis is that self-similar workloads should have consistent performance (unless something like workload or environment changes). Therefore they can:
- Partition workloads into cohorts or clusters (approximate intent).
- Build performance baselines (estimate distributional form, e.g. Normal).
- Estimate likelihood of delivered performance (test for stationary).
This gives a result of a set of events with predicted likelihoods and a time series of summary statistics describing concentration of extreme outliers.
They use approximations to unlock leverage. They assume metric distributions can be approximated by a Normal distribution (which is known to be untrue but this works most of the time and makes the math easier) and modeling errors are excluded via baseline qualification. Then:
- Workload z-scores (reflecting how many standard deviations away from the mean) are a proxy for likelihood.
- Workload performance should be IID.
- Z-scores follow a standard Normal distribution.
- Baseline distribution computation is "embarrassingly parallelizable."
- Z-scores are combinable (across cohorts!).
So they can develop a strategy:
- Aggregate z-scores across workloads.
- Monitor fraction of workloads with z-scores ≥ 2 in windows.
- Expect 2-5% 2σ outliers in a window
- When >10% of workloads are > 2σ, there's a problem.
Brian went through some common FAQs:
Q: Do performance metrics actually follow Normal distributions?
A: No. Log Normal is better in practice. Can use Empirical or other parameterized distributions as well.Q: How do you know if approximations hold?
A: You can do testing.Q: How do you define the cohorts?
A: Started with complicated clustering algorithms but something simple like "cross product of 3–5 features" works well enough.Q: How do you deal with singletons or infrequent workloads?
A: By definition we don't know how long they should take, especially if they're new.Q: Aren't there a lot of those?
A: Depends on the cohorts. Depending on the use case, only 10–15% workloads aren't singletons. But only 3–4% coverage can be a big win.What are some applications of these theories? Using 2σ instead of 3-5σ shows possible problems and not catastrophic ones, and thus provides insight into possible problems before they become actual (customer noticeable) problems or full-on catastrophes.
This can also streamline diagnosis. (An example of a 5-day I/O issue is in the slides.)
[Narayan went off on a rant here about correlations and kept reversing his words, larger vs smaller, does vs doesn't, etc., making it next to impossible to follow at speed.]
Conclusions:
- Reliability is a shared property between the consumer and the service.
- Reconstruction of end to end behavior is critical.
- Variability is what customers actually care about.
- Distributed systems often produce decorrelation.
- We can measure it and its absence.
- Workload correlation can identify proximate causes.
- Metric combinability is critical for analysis.
- Error recognition is a gestalt of human judgements over time.
- Due to the unrecognized problems in error recognition, SLOs aren't feasible.
Contributions:
- 2σ is a method that:
- Incorporates user intent in order to model expected performance
- Tests an IID hypothesis to infer when systems diverge from expected behavior
- To produce data products that are comparable and combinable
- We use these data products in order to:
- Perform change point detection when systems diverge from expected behavior
- Estimate the duration, severity, and specific impact of these excursions or incidents
- Localize subsystem performance problems
- Compare relative and absolute performance over time and arbitrary workload dimensions
- Directly measure correlation across subsystems and isolation domains
- Resulting in:
- Calibration-free insights that characterize the consistency of a system
- The ability to test system invariants continuously
- Data building blocks that can be reprocessed to answer many questions
Closing thoughts:
- We can do a lot better than SLOs... and we must.
- Performance data is much better than availability data.
- We need more models.
- We need help.
Modeling Alert Quality
[Direct slides link] [Direct video link]
What are good alerts? What are bad ones? The difference is important for reliability. But how do you measure it? What kind of trade-offs are possible? They presented a model of alert quality, including parameters like cost and accuracy.
If you want to improve your alerting you need to measure, fix, and iterate... and beware of optimizing for the wrong thing (like staff overtime).
There wasn't anything particularly new or interesting in this talk.
Emergent Organizational Failure: Five Disconnections
[Direct slides link] [Direct video link]
A look at five successive assumptions or mistakes that leaders can easily make when trying to guide teams building reliable systems. This talk focused on the people and organizational issues that can contribute to well meaning teams building software that is insufficiently reliable. This knowledge is helpful for both technical and managerial leaders guiding their organizations, as well as folks stretching to think more broadly about reliability at any level in their team.
Two items to level-set before we start:
- Reliability leadership isn't just at the executive level. What about infrastructure, customer experience, end-to-end product management, contributors on all sides.
- Emergence is the subtle inputs that combine to create a nonobvious outcome at the end of the day.
The five disconnections he talked about were:
- Believing there is a definitive production to be focused on. We try to focus on "critical" parts of our systems, but sometimes it's difficult to understand what's contributing to reliability. We need to avoid tunnel vision in our thinking. Thoughts:
- Is a deployment tool part of production?
- Is source control part of production?
- Everything contributes to emergent behavior.
- Missing how hard prioritization is. Reliability requires extra consideration. Poor planning creates cycles of apathy in teams. We tend to have a mental model of how things are supposed to work, but those models are all incomplete due to tunnel vision, lack of context, and the challenges of observing complex systems. We need to spend more time understanding our systems and each other, and plan to change priorities.
- Having unrealistic goals for reliability instead of sustainable ones. Many goals are unrealistic and unrealistic goals are demotivating. The realism of a goal is not related to the need. Picking realistic goals is hard when we cannot calibrate from experience. Outcomes are often distant from our actions. Results may happen outside our window of patience; if the pagers go silent is it because we fixed the problem or did we get lucky?
Sustainable goals are about progress instead. We can only move to this approach if we have teams effective in execution over time.- Treating your organization as simple rather than complex. Humans are hard to predict yet we disregard this complexity. We have command-and-control, documentation, and playbooks, but are we making room for and encouraging creative thinking in SRE? There's often a push for the former and less so for the latter.
The "captain of the ship" metaphor doesn't hold up in reality. C-and-C is often needed, but instead think of yourself as a gardener: Your job is to correct the right environment (soil, light, water) for the seed to do what it's meant to do. Create the right environment for your team's best work to emerge. "Moving the needle" means "moving the culture," and culture is a self-sustaining feedback loop for organizations.- Using incentives as a replacement for dedication. Incentives are transactional rewards for doing specific things (and how do you get kudos for incidents that don't happen?). Dedication — doing the right thing — is a reaction to trust and will persist, and trust can only be earned through action and time. (Shopify uses the "trust battery" metaphor: There's a potential energy between any two parties that's charged or discharged over time based on the actions you take.)
Actions are imperatives to building trust. Does the team believe their services bring harm or good into the world? Do people trust that doing the right thing will be recognized?
We need to reward dedication, but not necessarily through incentives (do X and get Y). We need to be careful not to exploit the workers by relying on their dedication lest we risk burnout.How do we avoid these disconnections?
- Take a holistic view; all things may contribute to the whole.
- Collaborate on mental models, make sharing and communication easy.
- Progress not absolutes; let goals shift instead of compromising quality.
- Use simple messages, accept differentiation and focus on the culture.
- Demonstrate trust through actions.
DO, RE, Me: Measuring the Effectiveness of Site Reliability Engineering
[Direct slides link] [Direct video link]
The DevOps Research and Assessment group, or DORA, has conducted broad research on engineering teams' use of DevOps for nearly a decade. Meanwhile, Site Reliability Engineering (SRE) has emerged as a methodology with similar values and goals to DevOps. How do these movements compare? In 2021, for the first time, DORA studied the use of SRE across technology teams, to evaluate its adoption and effectiveness. We found that SRE practices are widespread, with a majority of teams surveyed employing these techniques to some extent. We also found that SRE works: higher adoption of SRE practices predicts better results across the range of DevOps success metrics. In this talk, they explored the relationship between DevOps and SRE and how even elite software delivery teams can benefit through the continuous modernization of technical operations.
What is DevOps? There's no canonical definition or manifesto; it's an ongoing process of self defining. For Dave, "DevOps uses communication and automation to achieve velocity and stability in the process of making users happy." So how do we put it into practice?
DORA measures both speed (deployment frequency and lead time for changes) and stability (change fail rate and [median] time to restore service). See g.co/devops for more on capabilities. See the State of DevOps report at bit.ly/dora-sodr.
What is SRE? We're at SREcon. If we're here and still don't know there's a problem.
SRE vs. DevOps. Do they compete? Is one better than the other?
Given the cycle of Business — Product — Dev — Test — Deploy — Operate, DevOps is Dev-Test-Deploy, SRS is mainly Deploy-Operate, so there's some overlap. But most of the things are in the Business-...-Deploy area.
DORA recently added Operational Performance for Availability (→ Reliability).
How do we know if a team is doing SRE? If a team acts like it then they are it regardless of what they're called (e.g., sysadmin, prod engineers, etc.). Statistically significant findings:
- SRE is widely practiced, not only at big places like FAANG. 52% of respondents reported the use of SRE practices to some degree.
- SRE is good for humans and systems:
- Humans — Mitigates burnout and enables balance between coding and ops work.
- Systems — "Shared responsibility" for ops predicts better reliability outcomes.
- Business — Higher reliability predicts better business outcomes.
- Reliability and sw delivery performance are orthogonal.
- There's room for growth.
Dave's hot takes:
- SRE implements part of DevOps.
- DevOps culture → SRE culture → Toyota Production System → psychological safety → [...].
- Ops is still Ops.
The Scientific Method for Resilience
[Direct slides link] [Direct video link]
Do you remember the scientific method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, we were reintroduced to the scientific method, and learned how Vanguard's software engineers and IT architects draw inspiration from it in their resilience testing efforts. They leveraged a "Failure Modes and Effects Analysis" technique, in which engineers ask themselves questions about the failure modes of various technical components and develop hypotheses based on their expectations of how the system would behave. They used these conjectures as inputs into experimentation, and selected chaos experiments accordingly to validate (or disprove!) their hypotheses.
For those needing a refresher, the method is:
- Observation
- Question
- Hypothesis
- Experiment
- Analysis
- Conclusion → go back to 1. Observation
The example in the presentation started with an architecture diagram. Observation would be referencing it, identifying any critical components, and considering the business flow. Questions could be how each component might fail and what the effect would be if any one (or more!) failed. Then you could discuss those answers, develop a hypothesis based on the group consensus, and experiment as to what happens if you force a component failure (in your dev environment or test lab!). You could then use the available telemetry and observability to see the actual effects and analyze whether or not the observations met the expectations in the hypothesis. You could then conclude by documenting your work, capturing the process and observations, planning what to do in the future, and even modifying your variables (making system changes) and repeating the process.
A Fresh Look at Organizational Debt
[Direct slides link] [Direct video link]
This talk converged two perspectives on operational debt into a single model based on process gaps and risk. They defined operational debt as "work required to fix process gaps that present a risk to business operations." This talk was intended to inspire senior SREs and SRE managers to think more holistically about process problems and related automation opportunities. Weaving risk into the approach provides a mechanism to set priority on a risk/reward basis. Creating a backlog of operational debt is also helpful when collaborating with other engineering managers. They also re-examine Fowler's quadrants in the operational debt context, compared and contrasted operational debt with other forms of debt (financial and technical), and examined how much debt is also toil.
Toil and cost:
Small toil pile Large toil pile High cost Defer (perhaps forever?) Alligator v. swamp (may need intervention) Low cost No brainer (but also no pressure) Low-hanging fruit (also quite rare) Fowler's quadrants:
Careless Careful Intentional We'll use a brittle manual workaround to hit the deadline Defer the work until future growth benchmark Unintentional What's a runbook? We discovered it's already in production There's a lot we still don't know.
Closing Remarks
The conference ended with closing remarks from our conference chairs. They thanked us for attending (whether in person or remotely). Thanks again to our sponsors, USENIX, the A/V staff. Some statistics:
- 552 participants
- Only one positive COVID test (and they subsequently isolated)
- 17 program committee members
- 160 proposals → 34 talks
- 174 submitters → 46 speakers
Please fill out the attendee survey.
SREcon23 Americas will be in Santa Clara, CA (March 21-23, 2023). If you can't wait, SREcon22 EMEA will be in Amsterdam Netherlands (October 25-27, 2022) and SREcon22 Asia/Pacific in Sydney Australia (December 7-9, 2022).