(
The following document is intended as the general trip report for me at the 2025 SREcon Conference held in person in San Francisco CA from March 25–27, 2025. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
Travel day! As usual I managed to wake up before the alarm. I'd showered the nightbefore so I packed up the toiletries, CPAP, laptop, and phone; set the thermostat to heat to only 60F; loaded the car; and drove off to Detroit Metropolitan Airport.
Traffic was moving at or above posted speeds and I managed to get a spot on level 4 of the Big Blue Deck, so I don't have to take an elevator in the garage itself. The bus to the McNamara terminal arrived just as I got there... but was going out of service. Once I got to the terminal I did self-serve bag tagging and dropped off the bag (delayed by having only one lane open). Was given a hard sell to join the CLEAR program for the trip so she could get her commission for the sale, even if I cancel within the first two weeks. It did get me through security quickly, though.
Got to my gate and read through the 7-month backlog of magazines. Pre-boarded thanks to the cane. We had some interesting hoop-jumping because the front cabin phone wouldn't talk to the PA system (though it talked to the rear cabin phone fine, and both the rear cabin phone and flight deck could talk to the PA system). Since the flight attendants could make and we could hear announcements we took off anyhow instead of trying a last-minute equipment change (which incidentally would've made me miss my connection at SLC).
The flight to SLC was uneventful though occasionally bumpy. My arrival gate was literally across from my departure gate, so I hung around for about half an hour before boarding... and promptly got in the wrong seat (I sat in 1C not 1B). We got that straightened out quickly enough and the flight pushed back 14 minutes early. We got into SJC without incident, I got my checked luggage, stuffed my jacket and carry-on bags into it, and caught a Lyft to the hotel. My room was ready and I got checked in and unpacked before taking a much-needed nap.
Badge pickup opened at 5:00 p.m. and mine printed after I scanned the QR code they'd sent. Not having had lunch I grabbed a carne asada burrito from the hotel's Marketplace. There was a welcome reception from 6:00 to 7:00 p.m. with nibblies (mostly crackers, cheese, and nuts).
After the reception I headed back to the room to write up some of the trip report and crash.
The continental breakfast was on the vendor show floor. Today's selections were a variety of pastries, Greek yogurt, and fresh fruit. I had a single small berry danish, strawberry yogurt, and banana.
Opening Remarks
[No slides | No video]
The tradition of a late start continued; the 8:45 a.m. opening remarks didn't start until 8.48 a.m. Nevertheless, the conference began with the remarks from the chairs, Dan Fainstein and Laura Maguire. They discussed the themes they saw with the rapid rise of AI and software disruptions. Navigating disruption is a feature of SRE work but it's uncomfortable. We've seen that community matters, now more than ever.
We have 550 people in attendance (down slightly from around 600 last year), a large percentage of whom are new. If you want a conversation started, they provided three questions:
- What talk are you most excited about attending?
- What have you learned today?
- How will you change your practice?
They reviewed the code of conduct; we want to be a safe environment for everyone. The Slack channels are where changes are announced and the talks' Q&A will take place (#25amer- is the prefix). Thanks to the sponsors (the showcase is open all three days; please visit). The showcase will be open all three days, including the conference reception this evening. Birds-of-a-Feather sessions (BOFs) are tonight and tomorrow.
Like the past two years, for Q&A we'll be using the day-and-track channel with :question: (:question:) as the question prefix, and moderators will ask them.
Plenary: Safe Evaluation and Rollout of AI Models
[Direct slides link | Direct video link]
More and more online services and systems depend on artificial intelligence and large language models to implement core user experiences. Consequently, the safe and reliable rollout of new models and new prompts are critical parts of maintaining the reliability and performance of the overall system. However, unlike traditional systems, there is rarely a clean "working" or "broken" signal from releases. Instead the performance of new models and new prompts is based on probabilistic evaluation of the performance of the new system across many different user inputs. Any change to model or prompt may make some responses better, some responses worse, we need to be able to measure in aggregate across many experiences to determine if there is a regression that needs to be fixed or rolled back. Brendan Burns' talk was a hands-on introduction to approaches that they took during the development of the Azure Copilot and will both describe the problem of reliability in the world of AI models as well as real-world applications that are in use in production today.
The speaker suggested using AI to develop images for your presentations. While I commend the idea of using specific relevant images, the use of AI to generate images without compensation to the source materials' artists is a personal pet peeve. His suggestion made me lose any respect for him.
We're generally moving our models from "right versus wrong" to more of "good versus bad." In the former, we would expect either a success or an error response, but now we have to evaluate what "success" means since it could contain garbage (such as "2+2=5") instead of the correct answer. Users can provide feedback (as basic as a thumbs up/down) to the model, but users tend to (a) keep going (continuing the discussion through multiple prompts) and (b) provide that feedback only when they're dissatisfied with the result. You need to be careful with your assumptions about what thumbs up, thumbs down, and no response actually mean.
With AI, context is everything. However, the user needs to supply a lot of the context in their prompt, and it's not passed along with the up/down feedback. They did enable context preservation with feedback for their internal users' prompts.
He talked a little about metrics. Latency and error rate are still useful and worth collecting, but there are new ones that look more like social media. They pay attention to two additional signals to see if they're doing a good job overall:
- Net satisfaction (NSAT) — Using a scale of very positive, somewhat positive, somewhat negative, or very negative, they generate a satisfaction score by adding the very positive and somewhat positive percentages and adding 100 (so the possible score range is 0 to 200 instead of –100 to 100).
- Net Promoter Score (NPS) — The world is divided into promoters (positive), passives (neutral), and detractors (negatives), and the NPS is the percentage of promoters less the percentage of detractors.
NSAT and NPS always dip during any system outages independent of Copilot — Azure issues in any given region mean people get pissed and are less likely to vote very satisfied or satisfied in those cases. Cross-correlating to outages can clear some or all of the dips... but you need to be careful. Other things they've seen have been OS or browser specific for things outside Copilot's control.
Because the systems are dependent on user information you need it during the development... which implies developing in production. They run maybe 99% of the traffic through the production prompt and 1% through the in-development prompt. They leveraged the experimentation framework built into Azure. If they detect errors they disconnect the in-development feed.
It's also valuable to test in production. The 99/1% experiment can tell you if it's a good idea but they need to know if it's working. They dogfood their own; internal users get the "next" (dev, canary) version of the portal while everyone else gets the current (prod) version. The URL doesn't change so their users don't know. They include a break-glass to route the internal users to the current portal in case of emergency.
How do they qualify a release? It used to be "Run all the tests and if they're all green you release it." Now it's different because every change you make can make some results better but some results worse. You can't really unit-test for that. Instead you need to run thousands of prompts through the system to get a statistically-valid probability. That means you need to generate test cases (do you sample from end users and violate privacy?); they ask the LLM to generate 10k test cases itself. You still have to figure out if the result is good... you can ask the LLM "Given this prompt and this answer, do you think that's a good answer?"
Models are code, too, so you need to do the 99/1 testing there too. They've seen regressions. You need the same canary/break glass approach here.
Note that switching models can be a major event. The switches need to be planned for in terms of computer, system modifications, etc. It's almost like switching the underlying database or OS or system, so rearchitecting is required.
AI in a large organization: All the observability turns out to be crucially important. In any big org there's a lot of different product owners. Is "How do I backup my database?" a question for the backup team or the database team? Which team's handler needs to handle that request (or both)? Having the observability into how it chooses which to ask helped defuse the "Why do you love them and hate us?" conversations.
Q: Given that the prompt is the code, that raises a really interesting parallel between the "invisible" code in your prompt and your dependencies in a software supply chain. Are there any lessons here that AI and traditional codebases can learn from each other regarding understanding all of the code, even the invisible parts?
A: Look at prompt injection attacks (a variant of SQL injection attacks)! Also, this is natural language programming (NLP).
Q: How do you prevent the models from "cheating" and saying "yes, the result is good" when it is not? or generating only the test cases they "know" the answers to?
A: The models don't have goals to want to say Yes. They won't generate purposefully-easy test cases. They may generate bad test cases, but human inspection of some (say, 50 of 10K) may let you determine if the prompts are good. Breadth is probably better than a much smaller sample size. They also do bug bashes ("Everybody beat on it at lunch and see if there's anything very stupid!")
There is unfortunately documented evidence suggesting LLMs do in fact lie, cheat, and hallucinate.
Plenary: Improving the SRE Experience for 10 Years as a Free, Open, and Automated Certificate Authority
[Direct slides link | Direct video link]
Ubiquitous HTTPS is an essential part of a secure and privacy-respecting Internet. To that end, the public benefit certificate authority Let's Encrypt has been issuing TLS certificates free of cost in a reliable, automated, and trustworthy manner for ten years. In that time, they've grown to servicing over 500,000,000 websites. In this talk Matthew McPherrin went into the history of Let's Encrypt and shared helpful context for those managing TLS certificates, as well as information about upcoming changes to Let's Encrypt and guidance for the future. They also covered how they have strived to make the working lives of SREs around the world easier, and how the SRE community has helped them in return.
Let's Encrypt is ten years old! Back then, most of the net was still http (including session cookies). Firesheep was scary; it let anyone steal your credentials. Many websites couldn't support TLS at all. How do we encrypt [most if not all of] the web? TLS performance was poor, CPUs needed better instruction sets, and so on. But certificates were the problem: They cost money (which smaller organizations and individuals might not be able to afford), provisioning required a lot of emails and copying files around and manually installing them, and renewing was entirely manual. The only real fix was creating a new certificate authority (CA); Free, open to anybody, and 100% automated.
How did they become a trusted CA? The infrastructure is easy but how do you get others to trust you?
- Build the CA infrastructure. We do this a lot.
- Audits. The CA Browser Forum identifies base requirements so you can get a web trust audit.
- Become trusted. Once you've passed the audit you're good... but you need to get the root programs to decide to trust you (Apple, Chrome, Microsoft, and Mozilla).
- Now every single device on the planet needs to have its software updated to include the new direct trust. Alternatively, you can get another CA to cross-sign you. The cross-signer takes on risk if you start doing bad things.
Getting trust and transparency right was an interesting problem. First, they had everything open source ("boulder" is in their git repository); it helps a lot with industry collaboration. Next, the Mozilla root program requires fully-open incident reporting. Over the last decade Let's Encrypt has led the industry in how to create and handle incident reports.
They also had an RFC (8555) about the standardized ACME protocol. That lets them focus on building the CA side and get analysis of the protocol, and less so on the client side. They started with certbot (Python), but it's now embedded in a lot of places... including many other CAs.
They said they would ONLY accept the ACME API and no other APIs. Having a single path prevents side-channel attacks or back doors.
They also have certificate transparency logs. Certificate transparency (CT) wasn't around in 2015. The logs are run by multiple operators, browsers require them, and LE was the first to integrate them. Monitoring the logs lets you know what's going on in your infrastructure.
The Let's Encrypt team at ISRG is 25 people total (4 developers and 9 SREs). ISRG has three projects: Let's Encrypt, Divvi Upm and Prossimo (memory safety). One of their biggest concerns is efficiency. Scaling up the organization without scaling up the staff is a challenge: How do you build stable services requiring minimal human intervention? They attend SREcon and they also focus on the important goal ("to encrypt the web so everything online is secure").
They only do SSL/TLS certificates, not others like S/MIME. Let's Encrypt costs 4.5M/yr to run (up from 1M a decade ago).
What's their infrastructure like?
- 2 datacenter locations
- 3 racks with 24 servers, network, HSMs, etc.
- Some cloud resources for CDNs, DDOS prevention, logging, and metrics
- Air gapped offline PKI
Lessons learned:
- Hardware is cheap (compared to staff)
- MariaDB hosts 20 TB data, 10k reads/sec and 1.6K writes/sec (2x EPYX 7542, 2TB RAM, 24x 6.4TB NVMe; 4 that they can hot-swap).
- May need to scale horizontally someday (Vitess?)
- Software stack: Linux, Nomad, Proxmox. MariaDB, Saltstack, Ansible, Prometheus, and Redis
- Prossimo memory safety lets them use ntpd-rs, Hickory DNS (coming soon), RustIs, and river Reverse Proxy
What's next?
- Revocation changes: Ending OCSP. It's been problematic, has privacy issues that are difficult to resolve (and Chrome ignores OCSP completely and browsers have to fail open). OCSP replaced the older CRL. Browsers have done work to make CRLs better and consolidating them and pushing them out to the browsers without using the CAs. (Let's Encrypt will end their OCSP service later this year.)
- ACME Renewal Information (ARI): Systems need to know when to renew their certificates. Telling people when to do it is a problem. We used to use emails, but the 24-hour revocation requirement on Sat means you'll have an outage Sun. There's a new ACME renewal endpoint (it's in the final draft of becoming an IETF RFC). It'll provide advance notice of impending revocation. (Remember, LE certificates have a 90-day lifetime).
- Once ARI is a thing they will no longer send expiration emails. That means requiring less PII from the customers.
- Make the CT logs more efficient. ["Sunlight"]
- Short-lived certs (towards an opt-in 6-day limit) with the corresponding reduction in reliance on revocation... which may allow them to include IP addresses in the certificate.
The 6-day certs are opt-in and require a lot of automation in the client environment.Q: SSL for websites can be argued to serve two purposes. The first being security, but the second being this fuzzier notion of "proof of authenticity". Back when SSL was expensive, that worked, so the "lots of money" was arguably a feature, not a bug. Now that SSL is effectively free, we've lost the ability to prove the authenticity of websites. Do you think there's a way for an evolution in web standards and certificates to help address this issue?
A: Yes. There's not a lot of great UX around exposing that. There are people interested in extending to new ways of doing this; the EU is looking at tying a business identity certificate to websites. There were EV certificates but that's not automatable and doesn't really work well in practice.
Q: Does LetsEncrypt's last decade have any generalizable lessons about how/what sorts of widely beneficial "plumbing" can be built without VC funding?
A: In general, before LE happened nobody was convinced it'd be possible. A small number of people decided it needed to exist and could make it exist so they did. It was more an organizational problem and they solved it. (You might still need VC or sponsorships though.)
Q: Establishing trust seems to be an age old question even before computing. How much was the process of getting "trusted" a human trust problem versus a software trust problem?
A: It's almost all human. The code is the easy part. Everything else is a relationship problem: Goals, requirements, benefits, and so on.
An SRE Approach to Monitoring ML in Production
[Direct slides link | Direct video link]
Machine Learning (ML) is becoming a part of many aspects of SRE life. As an SRE, we are (or will be soon) dealing with the challenge of serving ML models as part of a large distributed production system. Unfortunately the domain expertise required to build ML doesn't overlap with the expertise required to run large distributed systems. The SRE community lacks standard practices and experiences that would allow us to operationalize ML and help to answer a critical question: how exactly do we operate ML at scale reliably?
In Daria Barteneva's talk we explored the (lack of) overlap between ML and SRE domains and discussed how we can help practitioners to solve common challenges. Scoping this talk to ML Observability, we decomposed a complex system into its primary components helping engineers to bridge domain expertise gaps in making ML systems more observable.
But when our production system serves ML models, relying only on traditional observability practices is not enough. We reviewed the characteristics and requirements specific to serving ML in production and discussed mechanisms that will help us to understand the end to end system reliability and quality.
Very few attendees have ever built a model. Of those who have, most had more than three levels of if-then-else. Everyone recognizes the model cannot be 100% accurate. "All models are wrong but some are useful." The world changes and models need to change to adapt. (91% of ML models degrade over time.) We need to have an SRE for ML. About half of the survey respondents are not monitoring their ML in production.
SRE has deep understanding of large-scale distributed systems running reliably in production building scalable, observable, secure, compliant systems that are resilient to traffic fluctuations and changes running in a permanent failure mode while dealing with incidents and fire drills.
SRE and AI/ML have some overlap in terms and scope and language, but terminology can differ. Both have common challenges in terms of scale (Moore's Law), performance (Wirth's), shortage of resources (Hoffstadter's), incidents or failure modes (Murphy's), no formal education to support developing cross-domain hybrid capabilities (John Dewey), system understandability (Russell Ackoff), and communication (George Bernard Shaw).
Building a good ML model is only the beginning. It's built to solve a specific problem, but is part of a complex distributed system. Customers don't care about high accuracy but that things work as expected.
Life cycle: Define tasks, get data, train and validate, deploy, monitor and manage, and go back to the data. Monitor business KPIs, infrastructure, service, data, and the model itself. You need to monitor what the customer thinks (such as the up/down feedback mentioned earlier).
Look at ingress and egress, long term trends, absolute vs relative, oldest vs newest data, processing rates per data type, end to end data processing latency, and explainability and understandability or explainability. And know about your tradeoffs: Was it built for completeness or latency (for example, traffic on freeways may care more about latency but financials may care more about completeness)?
Know your expectations — especially privacy, security, compliance and governance requirements and ethics and bias (guardrails) — first.
Production monitoring has five blocks:
- Infrastructure (cpu, mem, disk, net)
- Service (transaction rates, response times, application errors)
- Data (freshness, volume, noise, variability, distribution drifts)
- Model (accuracy, precision, recall, prediction or concept drifts, uncertainty estimation)
- Deployment (DIY CI/CD support for experiments and test in production; see the keynote regarding 99/1% routing)
What's a distribution drift? It's the change in data distribution into the data store with respect to a baseline. Over time, user behavior changes lead to input data changes so the model performance decays. This affects the model's accuracy. Drift can be measured in many ways. If your drift is "too big" (which you have to define for your circumstances) you may need to retrain.
You need to look for data/feature drift, prediction drift, and concept drift.
There's also human entropy:
- Telemetry hoarding (more telemetry is not the same as more observability)
- SLIs not aligned with business objectives or customer experience
- Never turn off alerts or monitors since you might need it someday
- 500k telemetry dashboards that nobody looks at or maintains
- Lack of shared understanding of different system failure modes
- Reactive culture where new monitors are only added after an incident
- Reward misalignment
- Observability hero (everyone should monitor)
- Monitoring individual components without end-to-end understanding
Transformers in SRE Land: Evolving to Manage AI Infrastructure
[Direct slides link | Direct video link]
The rapid advancement of AI has fundamentally transformed the technological landscape. As AI models grow in complexity and scale, the challenges of managing the underlying infrastructure have intensified commensurately. Qian Diing's presentation explores the unique demands of AI infrastructure and how SREs can adapt to this evolving environment.
He delved into the specific challenges of managing GPU-accelerated clusters, including anomaly detection, node lifecycle management, and the distinctive requirements of AI workloads. By sharing real-world experiences and lessons learned, they provided valuable insights into how SREs can effectively navigate this new frontier, ensuring the reliability, scalability, and performance of AI infrastructure.
Lunch
Lunch today had a middle eastern theme: za'atar spiced lentil soup, field greens salad with heirloom tomatoes and kalamata olives, mushroom saffron rice, grilled vegetables, chicken schwarma, and salmon, with mini lemon tartlets and baklava. Allowing for taste, the only complaint was that the salmon was dry (but you try cooking salmon for 300–500 people to hold and serve on a buffet line).
Live, Laugh, Log
[Direct slides link | Direct video link]
Telemetry pipelines are the unsung heroes that shepherd data from applications and infrastructure to your observability and monitoring systems. It's often up to SRE to ensure these pipelines are in tip-top shape, allowing logs to flow freely. However, a lot can go awry on the journey a log takes — from source issues and bad data formatting to misconfigured processing steps, congestion and under-provisioning. Paige Cruz's talk dove into operating and monitoring Fluent Bit, helping you live, laugh, and log reliably.
We started with storytime. On day one she had to start a service and do a pull request. The logging was a mess... "and that's just normal; ignore it." Tracing is good, telemetry is good, but logs are still just essential, and:
- Logs are everywhere... with inconsistent formatting from multiple sources.
- Logs are growing... at 250% volume year over year (from a 127-respondent survey). People tend to move them to cold storage [too soon].
- Logs have varied usage and are used by different roles with different things they're looking for.
Logs are a mess but they have value. How can we get them from place to place in the most performant way possible? Fluent Bit is what they use... and it handles metrics and will soon handle traces. It enables you to collect event data (logs) from any source, enrich it with filters, and send to any destination.
Things are great in the happy path when things work. What can go wrong? The four dreadful Ds:
- Dropped messages
- Delayed messages (e.g., alerting on metrics or on the content)
- Duplicated messages (Fluent Bit retries in chunks)
- Distorted messages (parser regex may not match, log formats change, etc.)
What causes those situations? Backpressure!
- Insufficient resources (CPU, memory, bandwidth)
- Data sources sending too much for ingestion
- Processing bottlenecks (filters run sequentially)
- Delivery issues (and bundling the continued processing with the retries)
Multithreading is the best way to increase performance... assuming you don't rely on strictly chronological order. Stream processing lets you handle a stream in real time and is useful for real-time and complex analysis needs. Meta monitoring is what it says when it handles its processing. You can't handle Fluent Bit's logs with Fluent Bit.
Distributed Tracing in Action: Our Journey with OpenTelemetry
[Direct slides link | Direct video link]
Chris Detsicas spoke about their journey with Distributed Tracing, leveraging OpenTelemetry and Istio in a dynamic microservices landscape. An internal Observability team embarked on a mission to empower engineers with deep application insights.
This talk encapsulated their journey, challenges encountered, and critical decisions made during the adoption of OpenTelemetry tracing. They discussed context propagation hurdles, the significance of automatic instrumentation, and importance of testing. Furthermore, they provided an overview of our pipeline implementation and shared key examples of how enabling our tracing solution has provided critical insights, helped us troubleshoot issues more effectively, and enhanced our understanding of application performance.
Before distributed tracing there were a lot of challenges in observability: manual correlation, lack of context (esp. in high-volume logs), tribal knowledge (made worse by turnover and reorganization), and people fatigue. These all lead to slow incident resolution.
Once you add distributed tracing, you have the same logs but the trace ID is added. Then you can look under the hood. You can dig down into the exception message and stack backflow. In this case the UI was running a test that no longer existed and it caused 500 not 404. They were able to submit a pull request to change things.
Context propagation is key. He gave an example; as each service calls its successor it keeps the same traceID, parentSpanID and spanID follow along through to know A called B called C.
OpenTelemetry (AKA OTel) is a standard collection of SDKs, APIs, and agents to import and export data. There's an OTel K8s operator.
Once you have the code changed people have to start adopting it:
- Automatic instrumentation is for a wide range of languages and functions (http, rpc, etc.) and allows for context propagation.
- Manual instrumentation is where a developer adds the SDKs or APIs to their code but needs to configure them to get what you need.
- Injected automatic instrumentation uses the agents to inject automatic instrumentation. This is a good place to start.
They have instrumented their core services and have a rich set of documentation. Tech talks (like this one) are also helpful.
How are they driving usage?
- Update runbooks
- Combine signals (traceIDs in logs)
- Trace explorer dashboard
- Incident game days
- Evangelizing ("Can we solve this with tracing?") — for example, they treated a Jenkins build as a single trace, identified places for improvement (embed a file instead of calling out to a remote host/file)
What's next?
- Instrument more!
- Further mixing of signals with metrics exemplars.
- OTel in the front end to better understand the entire user experience.
- Advanced sampling strategies. They do 100% in Staging but 1% in Production.
Q: Roughly how much manual effort was needed to fully instrument your environment, did data/logging metadata have to be updated at the source, and how close did auto-instrumentation get you to your final state?
A: Some teams went manual (C++, Go, Rust); others went with automatic injection. They aren't 100% done but have been doing this for 18 months.
Q: Sampling question: you may want to sample tracing based on return status, or total time, which you'll get at the end of processing. By then, you will have already propagated the trace flags, and you can't take that decision back. Which technique/strategy do you deploy to tackle this?
A: Tail sampling is where they want to look at.
Q: If you could go back in time and give your team one piece of advice about implementing OTel at thousandeyes, what would you say?
A: It will take time and adoption will be a hurdle, so "Don't give up."
Q: Does the auto instrument support gRPC, message queue? Or only http?
A: It depends on the language and the specific agent.
Q: Programming languages?
A: Java, Python, PHP, a few others.
Q: How have you found the upstream OTel community? How much support did you get from the community, and have you found things you wanted to fix upstream?
A: I wish we had more time. The community has tons of information and the Slack experience has been good.
Q: How did you measure success of the project? usage of tracing? improvement in TTR metrics? something else?
A: MTTR isn't useful as they've also changed processes. How many are using it? WHat questions are you getting? We see things from a performance basis more than an incident basis.
Q: How did you approach production readiness for the collector infrastructure itself? Have you had challenges scaling with demand?
A: With OTel we used tests, tried in staging, and managed to kill a database in staging once.
Q: How different is this from appdynamics?
A: This is an internal project.
Q: How much do egress charges from your cloud vendor affect your sampling strategy?
A: We're not egressing data.
Q: Do you operate a single layer of collectors for sampling, or multiple layers?
A: Just one for now, but tail sampling may mean more.
Lies Programmers Believe about Memory
[Direct slides link | Direct video link]
How does kernel memory management actually work? The Linux kernel provides a number of abstractions on top of physical memory, which, like most abstractions, can either be a blessing or a curse, especially when it comes to understanding application behavior. Some of these exist in conjunction with the hardware, like translation lookaside buffers, page tables, and the like, and some of them are Linux's own internal abstractions over memory, like different classes of memory within the operating system itself (with bonus special and often misunderstood properties).
Chris Down, a kernel engineer who works on the Linux memory management subsystem, went over things like the CPU's memory management internals, pages, the inner workings of virtual memory, and the complex tradeoffs made during modern memory management. Along the way, he tried to demystify the kernel and CPU behaviors around memory, went over how this might actually affect us as SREs, and hopefully enabled us to introspect and build more reliable systems as a result.
We often believe we know the answers about memory and swap and how they work. We're often wrong about that. For example, malloc(), despite its name, doesn't actually allocate memory. The operating system generally wants to extract the most possible use out of the available system memory without compromising safety, using minimal overhead resources, and do it all transparently to the application itself.
Linux employs the overcommit approach, letting it allocate more space than the system has. Demand is what loads the memory and swap moves it out to space. Allocating memory has nothing to do with actual use; applications often request and never use memory. The downside is if everyone turns up and wants all their memory there's a problem.
Linux does virtual memory under the hood, which adds one or more levels of abstraction to the memory. The program sees it as memory but it's really just an abstraction or pointer. It might be mapped to memory or disk, or not mapped at all. The kernel needs to figure out why something unmapped is indeed unmapped.
Demand paging allocates the physical memory when it's actually requested, when the CPU gives a page fault: There's no mapping from the requested address to physical memory. From user space, the app calls malloc() or mmap() or something similar in the user space library. Internally that's more likely a brk(), sbrk(), or mmap() system call or previously free pages. Virtual (not physical) memory is now set up for the application. The application tries to write to it, the CPU sees the request and generates a page fault, so the kernel sets one up and maps the virtual address to a physical address.
Similarly, free() doesn't really free the memory. It might (a) the RSS was unchanged, do nothing, so the allocator put it on the free list but didn't return it to the OS after all; (b) the RSS went down but not to zero, some of it went away and the allocator returned some but not all of it to the OS; and (c) the RSS went UP, because the allocator had to realign memory boundaries. free() is a signal to the memory allocator saying "I don't need this any more."
All this is why monitoring memory leaks in production is difficult.
He gave an example where RAM is very bad at accessing memory randomly. With RAS/CAS/PRE, which sequentially accesses an already-charged row. RAM uses row-first as opposed to column-first so doing a row-first traversal is basically 50x faster than doing column-first. RAS and PRE are expensive but CAS is cheaper (from an electrical engineering standpoint). Caches also help; sequential access uses a full cache line, where column-wise wastes bandwidth. TLB: Buffers are finite. Staying within 4KB pages minimizes cache misses. Single Instruction Multiple Data (SIMD) allows vectorization of sequential access to help with performance as well.
Remember too that there are different types of memory. The CPU doesn't care, but the OS thinks memory could be anonymous (not backed by a backing store), cache or buffer, slabs, and so on. Reclaimable or unreclaimable is important but not guaranteed. Resident set size (RSS) is kind of bullshit; it skews attention towards anonymous and mapped memory and ignores that many things don't handle buffers and caches.
cgroups are a kernel hack to limit resources within a specific service. They solve a lot of problems and limitations with Linux memory management. They're about 14 years old now. In cgroup v2 they limit everything, all kinds of memory, as opposed to process (cgroup v1) limits.
/sys/fs/cgroup has a best-effort slice of the nice-to-have bits and the workload slice has the things you must absolutely keep running. Where do you set memory.max to prevent OOMs until everything has one? We use memory.low on the workload slice to let the OS sort it out with reclaim.
This brings up to swap. Swap isn't about emergency memory; instead, it increases reclaim equality and reliability of forward progress of the system. It also promotes maintaining a small positive pressure (like "make -j crores +1"). Avoiding swap doesn't magically make memory-induced I/O go away. There are tradeoffs (see the post).
How can you view memory usage for a process in Linux? "top" and the like really only measure RSS but have no idea about any caches you may be using. The problem is caches (in a complex application) are not optional. The cgroup v2 has a memory.current that tells the truth, but it's complicated... Slack grows to fill up to cgroup limits if there's no global pressure.
RSS is a stable (static if wrong) metric which is why the industry standardized on it.
senpai is a tool to tell you how much memory your application really needs. It's new to the kernel metrics space.
Why do this? A team thought they used 150MB but it really needed 2GB to run. On each machine. It also amortizes memory when things are okay.
Q: How does malloc work for JVM behind the scenes?
A: "I have no clue" as he doesn't use Java.
Q: Is senpai a tool we can use outside of meta?
A: Yes; it's open source. Meta uses one that's integrated with another internal tool. It's about 200 lines; the magic is inside the kernel itself.
Q: How does over-allocating memory (small positive pressure) impact system performance?
A: In the novel case it's good because we distrust what application developers say. Fudging the numbers when the developers don't know what they really want works well since they often overestimate... but when it works badly it works very badly. There is an oomd user-space daemon to manage OOM issues.
Q: What's the best thing I should alert on for server oversubscription? Alerting on system memory use can be noisy, but I also want to find out about problems before things start OOMing.
A: In general, alerting on memory is a bad idea. People alert on things they're afraid they'll forget about, which is a bad reason. You want to know about bad things that are actually bad things. Alert on throughput and errors.
"On-Call Is Ruining My Life" and Other Tales about Holding the Pager as an SRE
[Direct slides link | Direct video link]
There's no other part of SRE life that evokes such a strong reaction as being on-call. From the fear and anticipation of your first shift to the white-knuckle drama of a total system outage and the joy and satisfaction of debugging a particularly thorny issue — holding the pager is as much a human experience as a technical one.
They've done some surveys, pored over the literature, marinated in their experiences and have some findings. What models are in use? How do we collectively feel about this work? What impact does it have? Can we do better? Will I get a pony? (Okay, maybe not the last one.)
He presented some provocative findings that question the status quo around on-call and suggest some experiments you can take back and test out. Maybe there will be a pony?
[Narrator: There was no pony.]
He told a story about being on-call. They ingested a lot through Kafka, and by doing the wrong thing he broke Kafka (caused it to freeze) in the middle of the day. Since then he's always hated on-call.
Most organizations are not set up to support engineers handling the workload, interruptions, and stress of on-call... but the teams are doing it anyway. Leadership often thinks "heroics are a viable, long-term strategy."
He started reading papers and applying it to his job. He wanted to research across the industry. They reviewed 30+ academic papers, 40 industry materials, a 65-question survey, and got a lot of metrics. 87% of respondents were not happy with the industry standard in any category.
He organized his findings into what he called the no-good nine, and spoke in depth on seven of them:
- Onboarding and training is the worst and makes people the unhappiest, 2.5 out of 4 stars on average. 34% said it was worse than standard.
- Reactive improvements or fixing what breaks is the junk drawer of ops. No real strategy, no measurement, and no investment. The few with good experiences had regular meetings to discuss what to do about it.
- Limited agency: 20% of people can't or won't update alerts. Permissions block improvement, some can only escalate, stakeholders block... and only 30% sometimes felt supported when they were overwhelmed.
- Complex or no processes. 7 said improvements were never prioritized. Escalation paths were unclear, people didn't know what to alert on. 34% said their team doesn't monitor how their on-call practices and tools are working. What are your response SLOs? Can you shower or is that too long? The person on-call might not know what changes in the environment (deployment, new customers, etc.) to know what might break.
- Unsupported handoffs. 28% never expected a handoff, and 11% sometimes get one. One handoff was "Here you go, have a good night." We know they're valuable but so few do it well. Take a few minutes to hand off what you did when you go off-call so the new person has a clue.
- Idealistic scheduling. 2.85 out of 4 stars for shift frequency, 2.83 out of 4 for schedule management, but 53% sometimes or always feel anxious about being on-call. It's also difficult to change schedules. People often thought 24x7 rotations for one week was generally fair.
- Too many responsibilities. SLAs aren't sustainable. There's artificial not actual urgency. You can't get ahead of the workload. There's tradeoffs in efficiency and thoroughness. Engineers repeatedly deal with the same problems.
What do you do in addition to on-call? Average of four things, including:
- handling incidents
- internal support requests
- external support requests
- fixing bug tickets
- feature development
- maintenance of alerts
- updating runbooks of documentation
- attending regularly scheduled meetings
(Percentages are on the slides.)
Overwhelmed? It's not just you.- Clunky tools [not covered]
- Noisy alerts [not covered]
On-call is high-risk and low-reward. It doesn't help your performance review (despite being 6–8 weeks of your work year) and won't get you promoted. On-call engineers and line managers carry the costs of poorly structured on-call programs and tooling. 74% of surveyed engineers reported experiencing overload, burnout, or both.
People work together to overcome the organization's shortcomings. 3.33 out of 4 stars for teammate support (83%) 3.15 out of 4 stars for management support (87%)
It's an industry problem, not an individual engineer or manager problem.
So what are his recommendations?
- Talk about appropriate practices, not best practices. What's appropriate for your organization may be different than for someone else's.
- Focus on habits more than wholesale changes for processes, responsibilities, and agencies.
- Bandaids are not real solutions. What you did at 3am to fix a crisis shouldn't be load bearing... though it often is.
Much of the recommendations assume there are on-call rotations. We don't anymore so this is less useful in the short term.
What's next?
- For research, trace the development of on-call programs over time, measure the impact of on-call over time, have industry discussion groups, and share what the appropriate practices are.
- Get better and visit statuswoah.com.
Q: What are your views on on-call being a "warm body" vs people who are able to solve the issue. Just being a person to say "you are important, let me find the right people" to leave the Uber engineer to not deal with false alarms/stupid users?
A: There are situations where an on-call router person or triage person is helpful. Tooling may help with this. Finding people who are usefully awake in the relevant timezone is helpful.
Q: Could we get a show of hands, who's in an organization where being oncall is explicitly called out as one of your responsibilities (along with major projects) to deliver on in your annual/quarterly/... goals?
A: Three people raised their hands. We don't have any real research results yet.
Q: What does it mean to design schedules for "life"?
A: People may have children, have health issues, or be caretakers. That doesn't always lead to a "24x7 for a week" mandatory schedule.
Q: Have you done/are you aware of any work on the pain of on-call triggering or signifying Imposter Syndrome? How much of the pain is just "oh crap I don't measure up"? Might this be a horrible feedback loop?
A: We didn't survey that but it's a great question. Even a senior engineer can push the wrong button.
Conference Reception
After the sessions we had the conference reception on the vendor floor. The food included:
- Assorted breads, cheeses, crackers, candied pecans, crudités, dips, and grilled vegetables
- Black bean quesadillas
- Chicken quesadillas
- Sautéed vegetables
- Build your own tacos:
- Flour and corn tortillas
- Rice
- Beans
- Adobo chicken
- Barbacoa beef
- Cabbage
- Guacamole
- Onions
- Sour cream
I ate, chatted with some folks about SREcon and LISA, how the conference venues were chosen and some of the constraints involved, and about our jobs. I also chatted with Casey Henderson-Ross about life and stuff and things.
I woke up, did my physical therapy exercises, shaved, showered, and generally made myself presentable (wearing an LSA TS-branded collared shirt and slacks, even!) before heading down to the continental breakfast on the vendor show floor. Today's selections were a variety of scones (blueberry, chocolate chip, orange-cranberry, or vanilla), greek yogurt (blueberry, strawberry, and plain), and fresh fruit. I had a chocolate chip scone, strawberry yogurt, and banana (which was riper than yesterday's).
It was very cold in the plenary sessions so before they began I ran up to my room for a sweatshirt. As it happens, my sweatshirt, like my polo shirt, is university-branded... and our first plenary was by someone from Ohio State.
Plenary: SRE & Complexification: Where Verbs and Nouns Do Battle
[Direct slides link | Direct video link]
SRE is one proving ground on resilient performance in action (also known as SNAFU Catching). It is a critical contributor to the scientific foundations for Resilience Engineering. A new round of growth and change is producing new complexity penalties — complexification. How will/can SRE cope as the lines of tension change? The skills and expertise to do SRE well are verb-centric — "resilience — as adaptive capacity — is a verb in the future tense." The human push for advantage from technology change is noun-centric.
SRE is one arena where the two framings conflict given the expanding the layers and tangles of interdependencies. David Woods believes SRE can adapt by innovating new verb-based means to see ahead in order to anticipate, to see around in order to synchronize, and to see anew to reframe models.
David came to praise us, warn us, and offer us an opportunity. DevOps built growth to manage complexity penalties ten years ago. Can we build adaptive capacity now to manage complexity in the future? The opportunity to escape the complexity penalties is marrying computational power with representative power. We keep underutilizing parts of the new technologies.
There are noun people who think the world is nearly optimal (more autonomy since people are the limit): they're stuck on stuff and ask how much, links, counts, and linear. There are also verb people who think everything is messed up (build future adaptive capacity because brittleness and surprise are expected) and focus on activities, doing, adapting, revising, and keeping pace.
In 2015, DevOps was continuous adaptability at scale. The future was here and not as advertised. When complexity penalties grow we need more adaptive capacity to keep pace with change. It's almost a biological model. We see the adaptability during incidents or outages and how we manage and resolve them.
Systems are messy; people or agents provide resilient performance. Systems are always adapting, seeking opportunity and handling challenges. Resources are finite, change is continuous, conflict is ubiquitous, models become stale....
So what about 2025? We realized by now that noun-based metrics like MTTR are just plain wrong (or possibly "not even wrong").
Now, new waves of complexity penalties threaten us. Can DevOps create new opportunities to build adaptive capacity? We're being slammed by AI Gold Rush II, faster-better-complexity pressures, growth, complexity, interdependencies, etc.
People adapt when they have to. The direction matters: We want to revitalize, reprioritize, or reconfigure, not retrench or retreat. Last time (2015) we revitalized so we've done it before. Can we again? Yes, but how? It's all about the line of representation. The cognitive work is done above the line; the stuff you build and monolith with is below the line. We control the tools and tooling. We can only see and act through representation. We use relatively crude and primitive tools. Our opportunity is to see the verbs: pace, tempos, tangles; more saturation, lag friction, congestion, cascades, and conflicts. The principles and techniques generalize but he can't build the tools... but we can.
We need to build tools that can show us what we don't but need to see.
Q: If we think of this like an evolutionary system, those who develop adaptive capacity will "survive" in this new world. What do you think the selection criteria will be as we respond to this new wave of complexification?
A: Guerilla work and underground adaptations are what'll be agile.
Q: Is noun vs verb a hard separation or is there a utopia where they coexist?
A: Both are required. The trick is being verb first and then noun, but right now we're noun only.
Plenary: The Perverse Incentives of Reliability
[Direct slides link | Direct video link]
Are you trying to improve reliability in your company, but coming up against it not being valued unless you're in an active SEV1? Struggling to build a reliability culture in a wider organization? Relying on heroics to keep the lights on?
The reality is that, for most of us, reliability work is not extrinsically rewarded: customers won't write in about the outage you didn't have, and investors aren't impressed that your site is still up. In today's "do less with more" world, increased pressure to deliver value (read: features) often comes at the expense of building resilient systems as we race to hit ever tighter deadlines. In the face of these perverse incentives, it's no wonder that having a reliability focus isn't the norm for so many engineering cultures. There is a better way: harnessing intrinsic motivation. Katie Wilde's talk covered approaches, tactics and lessons learned to overcome the perverse incentive problem, and how tapping into the inherent pride, joy and hilarity of incidents can transform reliability practices.
SRE is underfunded and overworked and trying to do more with less. It's harder to maintain the gains we've made, and are worried things are going to get worse. We should let it burn! SRE pays better than arson.
We're a cost center and in a world where there's a drive to have AI take over, it's hard to be a cost center. The organization's goals may not align with ours. Nobody cares about uptime until it's not there... and then it's your fault because you suck. The consequences of what you're doing is delayed into the future (because that's how time works). Application developers are highly (and perversely) incentivised to avoid helping operations in production. Bystander effect: some will do as much as they can to avoid dealing with an incident.
"Every single person who confuses correlation and causation ends up dying."
We run towards the fire (but we're not bystanders). Who broke us?
We measure, change, and analyze to understand the ROI of our reliability initiatives. But we don't need to measure to know things aren't good. (This is being a noun person, which is bad.)
How do we prevent outages from reaching production? Change is the problem. We can put a lot of effort into building the safety net... but is it worth the investment of time and resources?
Downtime is not the problem. (Preventing it isn't necessarily the right thing.) Extended downtime is our problem. There's a window of opportunity when (short term) downtime is okay. "Is my net down? Did twitbook go down?" If the users don't notice [or care] then downtime is okay... because they don't know yet whether they or you are the problem. Find the downtime window that works for you — it may be seconds, minutes, or even days. This is not your error budget.
One 44-minute outage that the users don't notice as downtime is considered as zero outages thus infinite uptime.
Mission: We need time to mitigate to be less than your downtime tolerance window. But you need to keep rebalancing that equation because the tolerance window may change.
"Do. Or do not. There is no try. ... But there is a rollback." Prioritize the generic mitigation. Consider testing rollback processes so you know they work in advance. Think about load shedding when the electrical grid is overloaded.
We deserve love and money (not the band). Greed and fear are two major drivers of humans. How do we exploit that to get love and money? It depends on what kind of org you're in: Make friends with your sales team (if you have one) or product manager, and with legal (what if we get sued). Your slightly greedy friends in sales and product management and legal will get them for you.
But what about the application developers who keep wreaking havoc and not caring? You still need them to care and not be incentivized to run away. One way is the nerd swipe (when things break, make it a tech whodunit). Another is to have fun; you can force them to give you time but not attention. You need to be kind — even when they page you at 3am to read them the manual — and have fun.
Q: Do you have any thoughts on the ethical nature or some other call to action SREs seem to have to correct even non technical systems? That is, not letting it burn down. Notable ones that come to mind are Susan Fowler and Liz Fong Jones, but many others as well.
A: Do not be in the business of what should be but in the business of what is.
Q: How can we safely compartmentalise our professional identity that tells us that the Right Thing is to stop fires before they start, without burning out ourselves?
A: Therapy.
Q: "Build fast and break things" is the bane of the SRE experience. Do you think the practice of making breaking easier to recover from further incentivizes poor input to begin with? How do you walk the line between aligning the incentives versus further perverting them?
A: The incentives are perverted. We can't control that people want to put junk in the system (and they're incentivized to do that). So how do you build a system that moves the crap out of production as fast as you can, as opposed to creating the ultimate filter in advance.
Q: How do you know when to not die on that hill, in regards to telling people their dumpster is on fire (and you just go fix it)? How do you keep perverse incentives on the dev side at bay? I'm assuming the "Downtime tolerance window" are your thresholds for developers?
A: Make it page the developers. "If you like it then you should've put an SLO in it?"
Q: What was the song?
A: There's a new one every week. Ask me afterwards.
Q: Many times SRE get involved into an issue which is not technical or more like a user education issue or something rolled out to prod and users doesn't know how to use it?
A: The system is working as intended, it's just not what you expected it to intend. It's an issue about boundaries; sometimes it's not an SRE problem. It's often hard for us to set those boundaries.
Q: Is the idea is that we're rewarded for quickly mitigating incidents instead of outright preventing them? Is there no hope to get people to care about the incidents we've outright prevented?
A: It's hard to get people to care about an alternate reality they did not inherit. You can make entertaining stories about The Disaster That Didn't Happen, because if it didn't happen there's a lack of imagination.
Q: Given how rare incidents are (~99.n% of the time aren't incidents) and incidents are prevented continuously, are there tricks you have to exploit that for the better?
A: The reason why incidents are rare is because your incidents don't hit the error budget. Talk about them a lot: here's what went well, here's what it prevented, and so on. Focus on the less-severe incidents and how what you did kept the severity low. You can also do incident reviews of near misses to celebrate what's working.
Q: How do we, as SREs with perverse incentives, somehow convince leadership to loosen their grip on the wheel and allow for smaller, more controlled breakages (i.e. doing disaster recovery drills or "chaos" engineering)?
A: First, don't use the phrase "chaos engineering" outside a room of only engineers; rebrand it. Talk to the legal and compliance people to use their wording to make it a compliance exercise.
Learning from Incidents at Scale; Actually Doing Cross-Incident Analysis
[Direct slides link | Direct video link]
For a few years we have discussed this idea of Learning from Incidents that encourages folks to deeply understand an incident through a thorough, in-depth investigation of how it came to be. Vanessa Huerta Granda personally led these investigations, wrote about them, and coached folks on them and while she stands by this process she has also seen how difficult it is to scale this process.
In her talk she described how her team (resiliency engineering) has been able to leverage their incident review program to learn from incidents at scale and how they've been able to analyze a universe of incidents broken out into quarters, years, products, and technologies and gain insights and make recommendations to improve our sociotechnical systems.
Measure a year in time, number of incidents, number of action items, how long the incidents lasted... but do they tell you anything? (Generally no because there's no context.) Instead we want to get insights and areas of opportunity.
Focus on learning from the incidents, blame-aware (but not blameless) review with action items. How do you learn? Narrative based approach: Identify the data, what took place where and when, interview people, have a review meeting, and generate a timeline. Jot down questions and collaboratively meet to share experience. Finalize the report in a format that's specific for the audience.
We do that today, but doing it in a self-sustaining program at scale can be harder. You want to decide how to limit the inputs (maybe only sev0 and sev1) and look for themes (did you depend on senior engineers' knowledge? are they still there? is there an interaction between teams?).
It's a lot of work to create a narrative timeline. And engineering and analysis are different skill sets, especially when it comes to qualitative data. (What matters? What doesn't? Can you understand what the data is telling you?) Presenting the findings and recommendations in a way peers and external people with differing priorities can understand is a challenge.
She talked about a success story at Enova. They have a dedicated resilience team. They hold blame-aware learning reviews for every major incident and sensitive incident. They focus on learning. They do macro analysis of data quarterly with recommendations, monthly meetings with product leadership, and other teams bought into the process (Compliance, Legal, Marketing, etc.).
How did they get there? They started with one person (50% FTE). They proved it was valuable and added headcount to form a team. They were vocal about their successes and improvements, and they continued to evolve. It's all about creativity... and sometimes some boundary pushing.
Pretend to be an analyst:
- Be curious, not judgemental.
- Focus on the SOCIOtechnical system.
- Pull at threads.
- Look at the context. (Context is everything!)
- Talk to people. (Understand what it meant for their point of view.)
Success 1: Move away from being an action items factory. During or after an incident you can do quick fixes and action items, but the data analysis looks at possible longer-term changes (recommendation, priorities, head count).
Success 2: Transparency. Teams could learn from each other. Including the whole organization (customer success, marketing, legal, etc.).
Failure 1: MTTX. People cared about it early on. But it didn't tell us much so they could stop using it (slowly and over time) by providing more insights.
Failure 2: Metrics and goals. Numbers of incidents and internally-caused incidents may sound good but metrics then become goals and people game the numbers. We added context to see there's a lot of learnings from different kinds of incidents (including near misses for security- and risk-related incidents).
How can you do this? Start small, get others involved (including management and other organizations), and don't try to be a hero. It's a marathon not a sprint so take your time.
Running DRP Tabletop Exercises
[Direct slides link | Direct video link]
A disaster recovery plan (DRP) documents policies and detailed procedures for recovering your organization's critical technology infrastructure, systems, and applications after a disaster. Hopefully you have DRPs for your organization, but how complete are they really, and how and how often do you test them?
In my talk, I helped attendees get a better understanding of what a DRP is and contains, as well as why it's important to write, test, and maintain service-specific DRPs and affiliated documentation. I talked about how we're developing and using collaborative discussion-based thought experiments to test our DRPs, including things they should and shouldn't do when they write and test their own. They may even have gotten some insights on how to design their own services for reliability and recovery.The talk was well-received. We had more questions in the Slack thread than I could answer in real-time, so we didn't need to use the questions I'd prepared in advance.
Q: I often see that "disaster recovery plans" are heavily (to use this morning's parlance) "noun-based", rather than being flexible to the many ways disaster can strike (and are therefore brittle in practice). Is this an issue you encountered with this work?
A: We hadn't thought about it in those terms, but yes, somewhat. The discussions, especially in our first exercise, showed us the need to have some degree of flexibility depending on the nature of the actual disaster.
Q: In your document template you didn't mention RPO and RTO, those definitions aren't important for the university use case?
A: We don't use the actual terms recovery point objective (RPO) and recovery time objective (RTO), but they're covered by our "minimal operational standard" and "maximum acceptable downtime" measurements.
Q: How are the tabletop exercise results recorded as materials that would survive staff change and responsibility/org change?
A: The working notes documents for each exercise is in a Google Drive folder in the team's shared drive (with a date stamp) so if there's staff turnover their ownership will change to the drive account as part of our offboarding processes.
Q: How much time was invested in preparing the role-playing session? I imagine it is costly to craft a realistic scenario.
A: I worked on it intermittently for a few months so it probably took me a few days to a week total.
Q: Were the players aware of the scenario before entering it?
A: For each day, they knew coming in it'd be a DRP tabletop exercise, but not the scope or the specific service until things started.
Q: What does the exercise actually look/sound like? Hands on keyboard? Whiteboard? Do folks think aloud and imagine the progress of the infrastructure "coming back to life," and/or do you have props and representations of what you're talking about? Or was it all 'theatre of the mind"?
A: All of it was pretty much "theatre of the mind." It was entirely verbal discussions scribed into the working notes. We didn't use any props.
Q: One of the problems we have with tabletops is that it's all...in theory hand wavy "if this then that" verbal sparring. Do you use mirrored staging environments and actually simulate any of these tabletops by kicking things over, sending alerts, showing prepped dashboard widgets, etc.?
A: In our 2024 exercises, no. We just had the verbal component scribed into the working notes.
Q: How did you generate these scenarios? (They can seem obvious in retrospect. Were they obvious at the time?)
A: We took the service we knew we wanted to test (based on discussions between me-the- facilitator and my supervisor), thought "What would be a bad disaster for this thing," then worked backwards from that.
Q: How are you identifying (in a ridiculously complex environment) the frequency at which you need to update the DRP (by doing exercises)? It sounds costly and necessary and we need to put hooks into things like "if we update X then DRP needs to tested again in light of that".
A: We review our DRPs at least annually (most are "every Jan") and test them at least annually (usually one of Feb, Jul, or Oct), which is two opportunities to revise based on what's changed.
After the talk (on our way to lunch, at lunch, and then regularly throughout the rest of the conference), I was stopped in the hallway regularly with positive comments about the talk or questions about the work.
Lunch
Lunch today was geographically confusing. They had a spinach salad (with candied pecans, blue cheese, and sweet onion dressing), a couscous salad (with peas and asparagus), small marble potatoes, yogurt tumeric chicken, sea bass, coconut chocolate apricot bars, and white chocolate blondies.
Handling the Largest Domain Migration Ever!
[Direct slides link | Direct video link]
Domains remain a critical part of web infrastructure, and an essential piece of the online presence of people and businesses. In 2023, Squarespace acquired the assets behind the Google Domains business, including more than 10 million domains. Franklin Angulo and Divya Kamat spoke about the challenges of executing a migration at a scale not seen before in the domain industry.
They started with explaining TLD/SLD and gave options for the former. Then the four Rs: registrant (user), registrar, reseller, and registry. Registrars sit between the registry and [the reseller and] the registrant.
They talked through the challenges of building their API from scratch and of dealing with third parties to meet their 99.9% SLA.
They managed to move almost 100% domains successfully (9M+ domains, 99.9% reseller API SLA, 360+ supported SLA). The exceptions were the ACME and DDNS customers since they couldn't write those features in time.
Q: Will you ever provide DDNS support?
A: We'd like to but it's a lot more complicated than they thought.
Q: If you could only pick one (or two, or maybe three) things: What would be the biggest cultural values that made the migration a success?
A: Everyone at Squarespace came together. They were able to get the resources they needed across the board. They also remained customer-obsessed to deliver a positive experience. Nobody shied away from having to work hard. They did have to change the corporate culture a bit so everyone would move faster.
Q: Were you required to have total feature parity at the end of the 10-month window? What was the consequence of missing your deadline for migration?
A: Contract negotiations and lawyers who didn't know the technical issues. Had they missed the deadline the servers would be shut off and the domains dropped on the floor.
Taming the Beast: Understanding and Harnessing the Power of HTTP Proxies
[Direct slides link | Direct video link]
Guillaume Quintard explored the often-overlooked power of HTTP and reverse-proxies in modern SRE and DevOps workflows. Starting with a fresh perspective on HTTP — its simplicity and quirks — the session delved into how reverse-proxies enhance observability, performance, and resilience. He spoke about how proxies can serve as invaluable tools for debugging, traffic manipulation, and active mitigation during production incidents. With a focus on actionable insights, the talk included code snippets, real-world examples, and guidance on leveraging tools like OpenTelemetry to equip SREs with practical strategies to manage complex systems effectively.
Guillaume is a developer, pre-sales and support engineer, develops in C, arch, and Rust. He talked about Varnish (who still meet over irc). He isn't talking about caching much.
He explained what a reverse proxy is.
Networking is like an onion: It will make you cry.
Varnish logs everything and does nothing with it. Can be like an event log. Can convert it to JSON. Metrics can tell the same story. One size or model doesn't fit all. Consider that retention time, searchability and granularity are spectrums. Remember too you can store something more than once. One client request to a proxy is usually two HTTP transactions, but one client request to the service can be dozens of HTTP transactions... which is why tracing is cool. However, OpenTelemetry is not the same as tracing (you can include a UUID across the entire request life cycle), and traces are not the same as logs. Context is essential. Network timing is usually "negative space" data so you need timing on both peers.
His demo with Grafana eventually worked.
Q: How much abstraction happens away from system inter-workings if you marry too many things into varnish? Do you have logging per "thingamabob" (like rate limiting logging, compression, etc.) so you know where all of the interplay is happening?
A: Varnish is extremely verbose. Usually the Varnish modules being imported will tell you all you need to know every step of the way.
Q: What are common pitfalls that you've seen developers make — asking for an SRE friend.
A: We all see an abstraction and they're all leaky to an extent. If some tool appears to be too good to be true it probably is. Varnish caching people often assume caching is easy and an afterthought. Not realizing that over- or under-caching or delivering the wrong thing due to screwy keys. HTTP is legacy and has garbage.
Q: Can Varnish replace nginx for gRPC? Specifically can the HTTP error codes that Varnish supports for gRPC be modified to emulate NGINX?
A: At present, no. gRPC uses a weird flavor of HTTP that Varnish doesn't support yet. By the next release (they do every six months), probably.
Q: Does emitting more data lead to more storage costs? or is it compressed or packed when ingested to save space? something else?
A: It's not specific to Varnish but yes. It will give you a LOT of data. JSON text is highly compressible.
Chaos Experiments: Datacenter Stress Testing
[Direct slides link | Direct video link]
In this session, Clayton Krueger explored how a financial services provider has developed a comprehensive, automated chaos engineering program, supported by strong leadership. While chaos testing is commonly done with individual applications, they've elevated the practice by applying it to an entire data center. This journey didn't happen overnight, and he took us through the key stages of their progress. He discussed the major challenges they faced specifically around fear, uncertainty, and doubt. He shared insights into the tools and strategies they used to overcome obstacles and the lessons learned along the way. Additionally, he shared their plans for future efforts and how they aim to further enhance the robustness of their infrastructure. This session was intended to deepen our understanding of large-scale chaos engineering in a complex environment.
They had a disaster recovery scare event (but he ran overtime and couldn't talk about it).
He showed various forms of testing including both manual and automated failover and triggering. Eventually the whole process was automated to make sure that nobody had unnecessary privileges, pre-checks to make sure things wouldn't blow up and enforce SLOs, and draining the right data center. There is a "Whoops" button to manually revert everything and revert the testing.
Business units can set some maximum down times for their services. Alerts become incidents after a certain amount of elapsed time. They used SLO burn rates to drive people to SLOs.
They did a lot of this before creating their own SRE team.
What about large findings? Use the existing process (align), consider what went wrong (would testing have caught this?), and use data not emotions. Assign ownership of incidents (not just actions) to more than one owner. Can pull in the BC people to get additional data center resources.
SRE is not "Should Repair Everything."
They built a suite to let other teams do stress testing with a single local stack and a monitoring dashboard. That gave them all a way to load test and look into a sample app, and edit it to test their own, and implement it in their own CI/CD pipelines. They also taught folks how to take Java dumps to troubleshoot in more depth.
Use future risky or riskier activities to keep resilience testing.
Measuring Availability the Player Focused Way: How Riot Games Changed Its Availability Culture
[Direct slides link | Direct video link]
Riot Games started its journey to building out SRE culture in 2020. The number one problem they had to solve first was a unified language across all teams and games about what availability was. In other words, they had to define "uptime." Maxfield Stewart's talk walked through how they developed their availability measurements by simple modifications to their incident management process and aligned leadership and engineers on being held accountable to availability using our most popular core value, Player Focus.
They want to be the most beloved game company. Everything they do is in service of the players.
He wanted to bring SRE to Riot, but without measuring availability (which they didn't do) it wouldn't be clear whether they were successful. So he had 12 months to come up with a metric and get leadership to buy in. That meant they needed a unified SLO (or OKR) across all the games.
Estimated time to achieve 70–80% adoption of a new technical standard across all 800 services at Riot is over 2 years. That's too long for the timeframe they were given. Also, users don't care if a microservice is down; they want to engage with the game how and when they want to. They can't use technology so take advantage of the cultural sense of ownership, and they had alerts and reporting and an incident management team that, while burning out, were very responsive. They also had good real-time metrics.
What did they decide to focus on?
First, incident prioritization. They recorded them all, but everything was treated the same way "because the players were hurting." They used to have severities (1/critical outage, 2/critical impaired, 3/non-critical outage, 4/non-critical impaired). They found that people deemed non-critical had been rolled into production. They threw that out and used priorities:
- p1 = impacts >50% of the player base on a single shard or multiple shards.
- p2 = impacts 15–50% of the player base on a single or multiple shards.
- p3 = impacts 1–15% of the player base on a single shard.
- p4 = impacts <1% of the player base on a single shard.
Now "priority" is just "number of users."
Next, they needed to have a common definition for impact. The solution must be something players would agree with. Their problems fell into three buckets:
- Connecting to the game (logging in or patching)
- Purchasing store (either purchasing content or purchasing currency)
- Playing the game (need some small number of buckets:
- Retrieve Inventory
- Match Making
- Chat and Voice
- Form a Party
- Playing the Game
- End of Game Rewards
All six need to be working.Great... but none of this is an SLO. It's establishing the baseline to create one, that's both measurable and something leadership cares about.
Look at serving power to households. There are 43,800 minutes per month. 2 hours (120 minutes) of failed delivery == 100 – (120 / 43800) = 99.997% available (0.003% down). One house? Big deal. 100K houses? Big deal. They don't have measures like player minutes yet. You can sum the number of players connected for a minute. Take the rolling average and say "At 2pm Friday how many player minutes should you be serving?" You have an approximation of service and of what the outage was. Combine that miss or outage with:
Total player minute a month – Player minutes impacted
Total player minute a monthResults are weighted and relative!
Their target: 99% availability as measured by the player journey. They serve over 200 billion player minutes a month across all games. That was contentious; given what they measured and the number of incidents they weren't sure they could make it.
Dec 2023 they had a problem with the store in Korea, and with end-of-game in EUW1 and matchmaking in Vietnam. The heat map showed what the likely problems were. Sum it up for leadership and they were told "98.97% but 13/16 shards" as a near miss.
How hard was it to make? A few weeks to design, 4 weeks to implement v0.5, 12 to implement v1.0, with 2 data analysts and one software engineer. Another 2–3 months of training, education, and information sharing. By mid-Q2 2021 all games were getting reports; by the end of Q3 the process was stabilized. That gave them a target within 9 months of the 12.
The CTO bought in. The studio lead (executive production training). The technical leads ran grass roots efforts. That led to a company level OKR enforced by the CEO.
It's not real-time; it's a month delayed. Incident impact data is available daily. Each report includes full breakdowns.
So what changed? Games averaged 97–98% availability in 2021. They now average about 99% (2024 EOY). They overhauled our RCA reporting. They funded an SRE program to have engineers available to work on specific observability or reliability projects (which is what he wanted; yay!). The focus on RCAs and availability reporting identified major gaps in observability and migrated observability platforms.
Internal morale survey went from 1.5 to 4.3 (out of 5) in 18 months (by giving them a vision). Live Ops as an org grew from 30–35 people to about 80 in 3 years. Riot started changing to match the reporting in 2023. In 2024 they evolved the player journey categories based on feedback (it's now over a dozen).
Self-Q: What are the real keys to success?
A: Top-down mandate (OKR/SLO), C-suite aligned and enforced; a reliable data source and analysts that can make it plain speak; a willingness to fail and iterate; and a passionate, credible owner with a passionate team (possibly with pre-existing relationships).
Q: How do you talk about sev classification when different classes of players are more vocal than others? For example, common complaints about match making latency at high elo are echoed more loudly because they're typically high visibility in the community
A: Internet-gathered data could be from a vocal minority. They have to use the methodology. 5% of the players are not a p1 just because the player is famous. p3 doesn't mean we aren't working the issue, just that p1 and p2 take precedence.
Q: You sold the idea to C-suite, but this all seems to point right back at your team. That's both good (you get more resources you need for SLO three 9s) but also bad (more pressure on you). Am I correct in that thinking, or were other teams beyond SRE responsible for SLO somehow?
A: Yes, other teams were also responsible. Executive leadership of the game teams were accountable for the balance of cool new stuff and for meeting the metrics. Miss the metric? Make a plan to fix it.
Q: How would you account in these reports for non-player incidents, like blowing through your monitoring budgets?
A: The p4 bucket holds a lot of those. There's a whole class of incidents that don't match this. Internal incidents follow the same plan, but player-facing is going to take priority.
Dinner
I didn't feel like dealing with a car (be it Lyft or taxi) to head to a vendor happy hour which may or may not have provided either food or drink, so I wound up having a spicy tuna roll and an East West roll (inside: shrimp tempura and avocado; outside: tuna, unagi w/ unagi sauce and tobiko). Both were very good.
After dinner none of the BOF sessions appealed so I headed up to bed (mostly to get off my feet).
For some reason my brain woke up for good at 3:30 a.m. and I could not fall back asleep. I wound up working on this trip report, checked my flight information for tomorrow (which is good as both legs' timing was adjusted), and determined that I would not have enough time between when the buffet opened at 6:30 a.m. and my ride to the airport at 6:45 a.m. to have breakfast tomorrow morning. Oh well, spending $40 with tax and tip on the hotel breakfast buffet wasn't high on my to-do list anyhow.
After showering and dressing I headed down to the vendor floor for the last continental breakfast of the conference. Today's options were a variety of bagels, again with Greek yogurt and fresh fruit. I had an everything bagel with whipped cream cheese, a strawberry yogurt, and a banana.
While waiting for the sessions to start I had a lovely conversation with someone who was considering moving into an academic IT role and had some questions. I hope I answered them to his satisfaction.
Stopping Performance Regression via Changepoint Detection
[Direct slides link | Direct video link]
Bloomberg's Ticker Plant infrastructure provides real-time market data to almost all internal and external clients; any increase in latency impacts much of the company's real-time products. This talk discusses how statistical changepoint detection is used to identify when our complex system's performance characteristics have significantly changed. We will discuss the challenges of deploying this, such as dealing with "expected" changepoints like market open/close and downtime, relaying the change information to engineers in an effective manner, and establishing a feedback loop.
We started by defining a changepoint: An instance of abrupt and persistent change in behavior of the system. Their Ticker Plant application feeds the vast majority of market data to customers (Bloomberg terminal, API, etc.). Low latency is crucial to the success of market participants. They need to be able to know whether their teams' efforts are improving performance or not.
Latency is often someone else's problem: News and politics, black swans, and data feed bugs contribute, as do changing expectations by instrument or event, and rarely-traveled code paths with bursty activity. When the market is volatile there will be increased latency.
The problem was decreasing performance but no clear pattern of stages or code rollouts, and not even when it started. Only once they rolled out a basic changepoint monitor and looked at the metadata before and after. Even the simple example is hard for a human to read; complex examples were nigh impossible, They decided to go with Bayesian Online Change Point Detection because it fit their expectations for changepoints despite their data not obviously being in the exponential-family.
One challenge was getting the algorithm to understand market open/close and weekends and holidays. Clutter and noise were also a challenge.
Q: How do different market activities affect performance?
A: Diversity of exchanges and how they deal with it. Commodities are different from equities.
Q: What percentiles did you use to calculate latency (p99, p50, ...)?
A: For changepoint detection they mostly looked at work time. They have internal SLOs for latency (et al.).
Per Aspera ad Productum: Turning Processes into Products
[Direct slides link | Direct video link]
Yuri Bernstein from Medallia talked about how to increase your team throughput by applying product management principles to SRE tools.
Yuri contends that reliability is automation plus consistency. Complex things couldn't be automated... but what if you made the process the product? They're a mediator between people and technology. They took an engineering approach: Any process can be automated. Most operational workflows boil down to, at a high level:
- Ask; request input or data.
- Act; take an action.
- Gate; get approval.
- Wait for something (async).
- State; report the status.
They first tried this with masking domains (so it would appear as feedback.customer.domain not feedback-customer.medallia.domain).
They needed to understand:
- What are the exact boundaries of the process
- Who are users and beneficiaries
- What are limitations
And they needed to improve by collecting and acting on feedback.
Incident Management Metrics That Matter
[Direct slides link | Direct video link]
Businesses run on metrics. They use them to judge success, identify areas for investment, and reward employees. Unfortunately, naive metrics can do more harm than good, especially in the context of low-frequency events like incidents. Management teams often reach for mean time to recovery (MTTR) or raw incident counts to judge the success of reliability and resilience programs, but these metrics generate spurious insights and perverse incentives. As SREs we can't simply tell the business not to measure them — we need to offer alternatives. This talk explores a starting list of things to measure instead (and how to build your own list), as well as a framework to educate less technical people on what the actual value proposition of incident management is.
They role-played with Laura as a new engineer and Jamie as the senior engineer, and they did the talk as a dialog between them. She said low-MTTR was good, but research shows that it's really a poor measurement. MTTM et al. also have their own perverse incentives.
What if we measure incident count instead? Logical and easy to measure, what does a change in count mean? Is it periodicity in incidents? Code freeze? Holidays? Again, there are perverse incentives against filing incidents.
What about the change failure rate? The terms change and failure aren't well defined. What about action item count? Leads to low-value actions.
Measured by week? By hour? Team by team? That doesn't give you what you're looking for.
[Much of this felt forced and I didn't take detailed notes.]
Systems Thinking with Poisoned Systems
[Direct slides link | Direct video link]
AI is often said to be a "garbage in, garbage out" solution. So what happens when you take a carefully tuned system and try to operate it with AI? AI assistance has some studied drawbacks: data poisoning, bias, inaccessibility, deskilling, and more. We could very well end up in a world that is run by inaccessible and inscrutable black box AI systems. But the situation isn't hopeless!
AI seems to be here to stay, but the drawbacks don't have to be. Hazel Weakly and Sandeep Kanabar took us on a journey through their personal experiences with biased and broken systems, how they worked around them, and strategies they have for addressing these issues as well as preventing future ones. Together, we discovered how to transform AI into a transparent and reliable tool that helps enable innovation rather than chaos.
AI is a flawed tool of understanding. Three major flaws:
- Data poisoning — A type of cyberattack where an adversary manipulates or corrupts the training data used to develop AI and machine learning (ML) models, aiming to influence the model's behavior or outputs.
- AI bias — AIs tend to reflect their human programmers' biases. They are systematic and repeatable errors in a computer system that create "unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the algorithm. They may under- or over-prioritize some information based on the biases in the training data.
- AI-related deskilling — Just because an AI says it doesn't make it true. Humans have skills and capabilities, even advanced ones, that inform a culture. We often lose skills once we think they're no longer important (for example, rotary phones or critical thinking).
Having AI complicates understanding the systems. AI often makes systems opaque so we can't inspect them. And it'll probably get worse (especially as we increase automation and that automation relies on these AI responses). We need to apply systems thinking even when the system is poisoned.
The Dread Pirate Roberts built up his immunity to iocane so he could drink from either glass and Vezzini would die. Resilience is working with what you have, not necessarily trusting what you're told (much less Westley and Vezzini). If the tools are resilient as well, your processes must become resilient as well. Humans build tools to understand the world. That's true even with AI-ocaine powder. (Get really good at drinking poison.)
"It costs $1 to tap the thing, but $99 to know where to tap the thing."
No Time to Do It All! Approaching Overload on DevOps Teams
[Direct slides link | Direct video link]
There's always more work to be done. Alex Wise took a look at signs of overload in our organization, how to identify them, and strategies for managing it. He covered concepts including Overload in Joint Cognitive Systems, WIP Spirals, the Utilization Trap, and how they can be applied to our organization.
Overload is where the flow of work into the system is greater than the rate of work it can perform. Production pressures often cause some things (like quality or testing) to fall by the wayside.
There are two fundamental truths (and Dr. Woods said them on Tuesday): Resources are finite and change continues.
If overloaded you can do one of four things:
- Shed load
- Reduce thoroughness (do it crappier)
- Shift work in time (do it later)
- Recruit resources
The latter two tend to be costly but give better outcomes.
If you start hearing a lot of the first two, you may be overloading the team. Responding with the latter two should be rewarded.
The problem is exacerbated by knowledge decay and queue management.
Knowledge decay is when your knowledge of the system degrades you're less adaptable and less resilient when it comes to fixing it. How can you prevent this? Retain employees (thus skills and knowledge). If that's not possible you need to build robustness for people leaving (which is a superpower).
The backlog can grow without bound. We've seen that once you get above 75% utilization things start thrashing (throughput tanks and variability increases). (If the team handles more interrupts, that number might be lower.)
You need to limit the amount of work in progress (WIP) to improve your flow rate and reduce variability.
Lunch
Lunch today was barbecue-themed:
- Mixed field greens, strawberries, radish chips, bacon, goat cheese, and sweet onion dressing
- Potato salad
- Baked beans
- Succotash
- Spiced grilled chicken thighs
- Smoked beef brisket
- Peanut butter brownies
This was my favorite of the three lunches, and the most to my tastes and preferences, though none of them were bad.
One Million Builds per Year, Only One Page: Operating Internal Services Without Heroics
[Direct slides link | Direct video link]
A nuts-and-bolts examination of how a small team at Octopus Deploy was able to deliver a set of internal services that enabled in excess of one million builds in a calendar year, with only one out-of-hours page in that time! Cail Young covered the technical and social aspects of what was involved, and discussed some of the downsides of having what appears to be a stable system.
Cail works on an internal tools team for product builds his team manages. The "One" page was "that woke someone up." Some teams are on-call for external services, some are on-call for internal services, and there's a secret third team that's not on-call for everything. They manage everything from GitHub onwards.
2021, there're tests running and they've got a team of about 70 people... but the CI/CD system wasn't really managed. They took it over (with permission) to patch the OS (manually). Developers don't like when the build system goes away for six hours. Admins didn't want to work outside of working hours. What did they do?
- Expectation management — You have to talk to people. Their problem was that as a remote-first organization with "home" being the eastern coast of Australia. They were able to get agreement on when they could have the system unavailable or dedicate a person to keeping it up and running during working hours. They also built a paging policy so they would only be paged if the supported tool was required to solve their outage. This is a de facto SLA.
- Reduction in Force (RIF) — Their profitability was dropping so they laid off a lot of people, shrinking the team to three people. This makes an on-call schedule even more difficult.
- Technical choices — They leaned into the cloud, using the highest level of abstraction they could as often as possible.
They chose to use HA mode, replicated mode, or both, even if it wasn't necessary (risk: byzantine failure modes), for consistency. There was an associated cost, but leadership understood they had to spend the money to prevent burning out the people remaining ("cost-aware but not cost-afraid").
They chose to auto-upgrade all the time (at least for download/prep; they definitely put in guardrails). Their compliance team understands the goal is to build and sell the product, so changes to production needed to be authorized and their quality gate was sufficient to be approved.Sometimes it's a Monday. Most of the time the upgrades went smoothly but sometimes they didn't. So they actually managed 1.5M builds with only one page [that woke someone up].
Problem: Things are now too stable to fail and show new people what failure modes look like. Incidents were no longer sufficient to be a learning case.
Challenge: Global growth of the company makes daytime versus on-call difficult. Not big enough for follow-the-sun.
Going Multi Cloud in a Hurry with Quality and Style
[Direct slides link | Direct video link]
How would you extend a Kubernetes based platform to support a second cloud provider? What if no one on your team knew the second platform well? Geoff Oakham talked about the soft skills and techniques he tried while delivering the product on time, met compliance standards, and trained up his co-workers.
They manage a platform built on Kubernetes on GCP with Help, ArgoCD, Prometheus, Frafana, etc. etc. Their tenants generally love them. They also set observability, reliability, Infrastructure as Code, and so on. The team's values include quality, reliability, automation over toil, empathy, work-life balance, and it being a team sport. Mostly it's on GCP but they need to use AWS too.
Management reprioritized work so the team could provide more than advice to those needing AWS... without any AWS experience.
They tried pair programming, but not everyone was engaged and it wasn't the best use of time. Hurdle: The new cluster needed a higher level of compliance including sign-off and documentation and wanted architecture diagrams in advance of doing the work... with a four-week launch window.
They regrouped to get more help from leadership, clarity on requirements, and opened communications with Product, Engineers, and Compliance (including asking for a consistent contact in Compliance). They had a concept of 2-hour "jam sessions" for the team to meet and talk about whatever they needed to. Struggling with something? Bring it! Learn something new? Share it! A couple of problems: Focus was a problem and there was no agenda. v2.0, add an agenda. They began inviting others who were struggling (developers and compliance). They formed ad hoc subgroups to work on tasks.
Problem: Two meetings twice a week was too frequent. So in version 3, they had a thread for upvoting topics to include on the agenda. They'd have guests go first, then topics of common interests, then topics of niche interests.
In the end, they had a prod quality AWS product in six months... which is longer than the four weeks, but it was what they needed before their real deadline.
Results: Happy customers, shared AWS skills, reliable platform, and no outages.
Reflections: Why was this project successful?
- The jam sessions were useful mostly for rapid learning.
- They succeeded because the team was willing to try things with him and willing to make mistakes. (Psychological safety all around!)
- Leadership gave them autonomy to run the project how they like.
- To succeed, he needed self-awareness, a balance of confidence and humility, being able to listen, being able to negotiate, and determination to keep going. (Note the "soft skills" here are incredibly important.)
Q: Has hiring for emotional intelligence been there at your employer for a long time, or was it something that happened in response to issues with psychological safety?
A: It's a key component of his director's hiring practices, and he'd done this before ("Don't hire talented assholes; it's not worth it").
Q: Going off your point about soft skills in SRE being so critical, how do you mentor those skills in a team with folks of varying levels of experience?
A: He's learned that people won't grow in that direction unless they want to. If they're showing signs they want to, encourage and nurture that.
Q: Can you please share your advice on moving up to a higher abstraction layer based platforms that might enable teams who are only familiar with one cloud platform to deploy to multiple clouds without having to learn Cloud #2 (AWS)? For example, as a thought experiment, what would you do now if you also had to deploy to Microsoft Azure?
A: By having a description of what they needed, already abstracted, with usage documentation — building to the spec — helped. He'd do the same for Azure.
Mitigating Against Large Scale Systemic Failures in E-Trading
[Direct slides link | Direct video link]
Electronic trading systems are inherently complex and operate within narrow, high-stakes time windows, making their availability critical. Despite employing various resiliency patterns, these systems remain vulnerable to tail risks that could lead to widespread failures with significant consequences.
Chris Hawley's presentation explored real-world examples to uncover the nature of these risks, examined the limitations of common resiliency strategies, and discussed alternative approaches to enhance system robustness and reliability.
Time of day controls when markets trade. Most of the ones they deal with are narrow (such as 8am-4pm in both London and New York). For example, in Dec 2023 a London Stock Exchange disk failure caused trading to stop twice. Risks for them include:
- Financial
- Regulatory
- Reputational
Availability: Multiple locations, multiple instances to scale, staggered deployment rollouts with significant testing. But even then things can still go wrong:
- Complex systems will always be exposed somewhere.
- Problems can go years without being triggered.
- When triggered we have a widespread impact.
Risks:
- Latent bug (finally triggered)
- Poison pill (bad data from external source)
- Interdependency failures
- Environmental change (someone else made, or external news affecting volatility)
Incident example: A triggering event caused an incident due to a long-existing latent bug. What's the impact? What's the fix? How do you debug the issue when the developer left ages ago?
Fundamentally you can mitigate against systemic issues by looking at or even changing the base architecture, on which sits the code, on which sits the tests. You can reduce the blast radius: use different variants of the service that only has the bits it needs. More complexity but limited blast scope. Alternatively they can look at modular functionality with an isolated core. An alternative system may overlap with what you have, so consider using another business unit's system to keep going... if everyone agrees and knows how to do it.
Summary: There's no silver bullet or quick fix. You need to invest in the architecture and expect that things will go wrong, so you need to build to minimize the risks.
Q: Have you considered using some type of resiliency modeling (STPA) to better understand the risks posed to the system?
A: Yes. The talk the other day was great. More and more they're inserting themselves in the design process before the thing is built; getting involved earlier saves time and effort.
Q: Are you seeing some categorization of outages lead to more incidents than others (e.g. more poison pills than latent bugs)? If so, are you targeting those categories more than others?
A: Fortunately these are rare so there's not a lot of data. They tackle them as they see them.
Q: How much blameless vs. blame-aware balance do you have when so much money regs and rep is on the line?
A: We're pretty good on that. Nobody likes incidents but the first question is not "Who's to blame?" but "How are you?"
Q: How do you manage the trade-off between service variant maintenance and blast radius? Is it managed as configuration?
A: There's a lot of config mgmt tooling (which has its own problems). There's a tradeoff. The advantage of being in Finance is that they understand ROI pretty quickly.
Q: You talked a lot about how to architect around failures where systems go down entirely, but what about when they get into a runaway state like happened with Knight Capital?
A: That's a different type of problem; you need to be more of a detective and monitor the traffic pattern, with a virtual axe to put through the cable. There are now mandatory kill switches that need to be regularly tested.
Q: What's for dinner?
A: That's on you.
Hijacking Service Discovery to Simulate Dependency Degradation
[Direct slides link | Direct video link]
Services have dependencies, and dependencies degrade: they can slow down, limit the bandwidth, or go entirely offline. Services should have mechanisms to deal with that: circuit breaking, bulkheading, and graceful degradation are some of the mechanisms developers might want to implement. But how can they confirm that these mechanisms work without waiting for an incident to happen? Simulation!
There are a few solutions for simulating dependency degradation, but a majority of them require traffic to be forwarded through a proxy. In Abdulrahman Alhamali's talk, he presented a few ways to streamline this traffic forwarding, by hijacking service discovery.
Shopify is an 8k+ employee company supporting millions of merchants with $1T/quarter (and $11.5B GMV last Black Friday). The resiliency organization has several teams (SRE, tooling, and incident response who communicate with the merchants).
Rather than wait for incidents to happen they run gamedays: They build a resiliency matrix (what are the flows your services have), then look at where things might fail, and what happens in that case... then you inject the fault to simulate the incident.
Should you shut off the dependency to simulate the failure? Pros are it's more realistic; cons are it could affect multiple services and abstraction might be better for feature teams. The short answer is "It depends." They simulate it with fault injection proxies; they use toxiproxy (Shopify-developed open source) and envoy (allows for fault injection).
If you have to use toxiproxy, how do you force traffic through it? You change the code or config files to route traffic to db2 versus db1... and that sounds like a bad idea. We don't want teams to mess with code and configs and push a new image just to run a gameday.
They provide GamedayBuddy, a k8s operator, that spins up toxiproxy and routes traffic through it, basically doing the whole thing for you.
How do they hijack service discovery?
Locally using hostAliases (talk to B not A)... which doesn't work if they include the trailing dot on the host name (foo.com. instead of foo.com). Globally using CoreDNS (everyone talking to redis should go through toxiproxy). They use it in only limited circumstances. On the way using Zookeeper for service discovery and Zoocreeper between the client and Zookeeper to replace the information in the stream. It speaks multiple protocols (adding new ones can be done in a plugin). You can hijack exactly one or multiple dependencies. They support all three methods. Zoocreeper is not yet open source (though they're hopeful it will be someday).
Summary:
- Running gamedays is important (even if in a tabletop exercise).
- You need to provide the right tooling to decrease friction and empower them.
- Service discovery hijacking helps greatly with that.
Q: Do these gamedays surface findings that the service owners couldn't easily predict?
A: Yes. Service owners don't realize they're dependent on some things.
Q: Do you run gamedays in staging or prod or both?
A: Both, depending on the service and its criticality. For example, we used to manage the status page and we could afford for it to be "wrong," but the checkout logic cannot be,
Q: How do you avoid Zoocreeper becoming a new critical point of failure that your team owns?
A: Don't keep it in production. Once you remove the gameday config it uninstalls everything and it resets everything back to normal.
Q: What do you avoid to causing major outages because of gamedays:
A: Make it easily roll-back-able. We saw it cause a major outage once, and the team didn't think to roll it back (oops).
Q: Do you incorporate these gamedays into your error budget? How do you get leadership buy-in for something that can potentially have wide-reaching consequences?
A: It's a conversation. Our leadership is heavy on reliability and if it causes errors beyond its intended scope, you can roll it back and now you know what is really in scope. Also, if something is too critical to run in prod, you can run it in staging. Spin up a new cluster, isolate it, and test there if it's super critical.
Q: How do you gameday Zookeeper failures?
A: We don't own Zookeeper so we can't really answer that. The biggest failures we've seen in the past are leader election failures.
Plenary: Technical Debt as Theory Building and Practice
[Direct slides link | Direct video link]
Yvonne Lam examined the connections between technical debt, housework/care work, and infrastructure in order to talk through strategies for understanding the shape of our technical debt, picking pieces to pay down, and building narratives with conceptual integrity around technical debt.
There's no one true way or magic bullet to addressing technical debt. Her background is in smaller to midsize companies. She's "disastrously forthright" and "the most academic non-academic."
This talk is about using metaphor to talk about your tech debt so you can then think about how to convey it (and possible solutions for reducing it) to someone else, like your manager or CTO.
One metaphor for tech debt is housework (keeping your home environment under control, possibly even if you have issues doing so (cf. neurodiversity, disabilities, and so on))... "but we can't use it because tech has been made up of people who don't do housework, or manage housework being done."
The origin of tech debt as a concept was coined by Ward Cunningham in a 1992 interview. He was influenced by metaphors, a financial job, and using debt as a way to explain the need to refactor the product in light of new knowledge gained after users encountered the product. Gene Kim said:
There is a great definition of technical debt that was given by Ward Cunningham: "Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite. ... The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt. Entire engineering organizations can be brought to a standstill under the debt load of an unconsolidated implementation, object-oriented or otherwise."
My simple take on that quote is that technical debt is what you feel the next time you want to make a change. It accrues from not only all the shortcuts you make in a project when it's rushed, but even every time the developers don't write an automated test — every time that you don't do data code analysis. When you skip these things, the debt builds up every day. It's basically all the variants, from the right way to do things versus the way we actually do things. The more of those that accumulate create this ever-increasing amount of technical debt.
Tech debt is a social concept; it's powerful enough that when we hear the words we know what social response is expected, but not powerful enough to communicate on its own with more specifics. It's not only a social concept.
Yvonne's thesis: Tech debt and housework share qualities with infrastructure that make housework a powerful metaphor for thinking about tech debt.
Yvonne shared several examples and horror stories. (See the video if you want the details.)
Plenary: AIOps: Prove It! An Open Letter to Vendors Selling AI for SREs
[Direct slides link | Direct video link]
SREs are not known for being eager, optimistic early adopters of shiny new technologies. We are much more likely to subject you to lengthy monologuing about all of the ways said technologies are overhyped, under-delivered, and prone to spectacular, catastrophic systems failures. Which brings us to the topic of AI.
It's easy to be cynical when there's this much hype and easy money flying around, but generative AI is not a fad; it's here to stay. Which means that even operators and cynics — no, especially operators and cynics — need to get off the sidelines and engage with it. How should responsible, forward-looking SREs evaluate the truth claims being made in the market without being reflexively antagonistic? How can we help our orgs steer into change, leveraging AI technologies to help our teams ship better software, faster? And for the vendors out there using AI to try and help solve traditional SRE domain problems, how should they demonstrate that they are engaging with these problems in good faith, that they are more than just hype and snake oil?
There's a lot of hype about AI, and as a generalization we find hype off-putting. Under the hype there's some reality... and this new type of computing means we need to change (at least a little). Vendors trying to sell AI-enabled SRE tools sound like marketing and when he asks for details (examples and data) they go radio silent.
So what are we being sold? "SRE companion that will do everything and slice bread."
"Anomaly detection in software is, and always will be, an unsolved problem." (Allspaw, 2015) And nine out of 10 times you can replace automation with AI in his open letter and it will still hold true. (Not quite all 10/10.)
The important questions to ask: The way agents are designed and where humans are in the loop matter:
- Are you (that is, is the human) supervising or being augmented? (Supervision is the default. Example: Autopilot needs your hands on the wheel anyhow.)
- What perspectives does it bake in? What's within its universe? What data sources are accessible? What operations can it do? What parts of the sociotechnical system are within its context? Is it exposing the right type of interface? (Example: Drive a bicycle with a PS5 controller.)
- How good is it delivering what it promises? Things often fall short. How closely does expectation match reality? (Example: any mass-market burger.) Is it going to become a hero? What work does it displace? What work does it create? What happens when it has outages or becomes too slow?
- Who is ultimately accountable for it? Is it walking you down a specific path? Does its flexibility limit what you can do or force you down specific paths? Does it let you explore better or tell you where to look? Does it show only what [it thinks] is important? Does it suggest what to do? Does it force or take specific actions?
Who does the adapting? Who learns something? Who does the fixing? Who decides an incident is ongoing? Who can change that definition? Who adapts when new information comes up?
(Responsibility versus authority.)tl;dr for vendors:
- Join — Build a team player or otherwise be predictable.
- Extend — Give more powers to existing agents and improve their abilities.
- Ground — Understand existing known patterns to serve your users best.
We don't expect vendors to listen or care about this... but we do expect SREs to care.
SREs have long been the guardians or shepherds of risk for complex software systems. And AI is a bunch of very complex software systems. We need to guard against the risks. We also have to evolve. The biggest low level existential risk is not innovating or not moving fast enough. Organizations can fail in many ways (including financial).
This is a great change, and change brings opportunities as well as new risks. And even if you don't believe it or want it to be true, executives want it to be true. They're facing a completely different set of pressures.
Whenever you're being asked to do something from up above that doesn't make sense to you, they're probably facing a set of pressures or expectations that you don't understand or know about. Be empathetic and become an ally to do what they need in a non-stupid way. As a corollary, transparency can be a huge risk mitigator. Don't fight the seems-stupid but ask for details (transparency)) so you can help not hinder.
SRE is a high-context role. We thrive on context and can be trusted. If we can't be trusted you don't deserve our labor. We're good at figuring it out.
If the history of technology tells us anything, the skills we spend a lifetime enhancing will still be valuable if we extend them to new technologies. Founders often need to suspend disbelief and that's hard for a lot of SREs or engineers. We see (and focus on) the risk while the founders see (and focus on) the rewards, because that's what both classes of people do.
Do you have moral or ethical objections to AI (e.g., about the environmental impact thereof)? You can't help solve collective action problems by opting out of them. If you want your critique to be heard it has to be an informed critique. We expect the worst but keep working for the best.
Closing Remarks
Thank you all for being here.
Over the last three days we've seen a lot of tools, tips, techniques, and practices for navigating disruption and navigating reliability. We've heard from a wide variety of disciplines from all around the world. We hope you're leacing with new knowledge and relationships to keep things moving when you're home.
We are better together. The last three days prove it.
Thank you again to all our sponsors, program committee, lightning talk coordinators, steering committee, Usenix staff, Usenix board, and all our speakers and discussion hosts. We had over 550 participants, from 20 countries, and 228 companies.
Videos of and slides from all talks will be online in a few weeks.
- SREcon25 EMEA is in Dublin IE (7–9 Oct); CFP is open.
- SREcon26 Americas is Mar xx–yy in Seattle WA; the CFP will open around August.
Please fill out the survey!
Dinner
My original plans were to go to the Brazilian steakhouse with a local friend, but he got sick and had to cancel. My fallback plans were to go there with a friend attending the conference (and possibly others), but he decided he needed more time to pack before heading to the airport at 6:00 a.m. Later on he decided he'd rather do the steakhouse after all, so after the final sessions wrapped he, I, and 11 other people headed off in several vehicles to Taurinus Brazilian Steakhouse in downtown San Jose. I had a caipirinha and asparagus, caprese salad, marinated artichokes, potato salad, prosciutto, salami, and smoked salmon from the cold bar; shrimp from the hot bar; and bacon-wrapped chicken, bacon-wrapped filet, beef rib, flank steak, grilled pineapple, leg of lamb, parmesan pork, pork rib, pork sausage, and top sirloin from the gauchos. About three hours later the 13 of us headed either back to the hotel or (in the case of the locals) home.
I was tired so I went upstairs to do the initial packing and go to bed.
It's time to travel home! I realized I kept waking up to check the clock so I gave up trying to sleep around 4:00 a.m. I showered, finished packing, and checked out, then caught my Lyft to the airport.
Unlike at DTW, there was no wait to drop my bag almost next to no wait at Security. Grabbed a sub sandwich for breakfast (mainly because it was just across from Security and right next to my gate) and settled in to charge my electronics for a couple of hours before boarding. It's just as well I was early; the jet bridge at my original gate wasn't working correctly so we had to change to another gate. Pre-boarded ahead of the see you next Tuesday that wanted to block the Sky Priority lane for half an hour before boarding.
The first flight was uneventful, though my window seat's window was permanently blocked off and the Wi-Fi never actually worked beyond messenging. I'll probably call and complain to see if they'll give me a credit. I made it to the revised gate for my second flight as the arriving plane was discharging passengers and an hour or so before my boarding was scheduled to begin. Once again I preboarded first and settled into my seat.
The second flight was also uneventful. There was some kerfuffle over my checked bag — somehow it didn't come out on the conveyor when I was there, but they were able to find it in the back room — which delayed my leaving DTW until after sunset. In and of itself this wasn't a bad thing, but there was a dead deer in my travel lane on US 12. I managed to swerve enough to miss most of it but still did some minor front-end damage to the car. I have an appointment scheduled to get a repair estimate before involving the insurer.