The following document is intended as the general trip report for me at the 2025 SREcon Conference held in person in Seattle QA from March 24–26, 2025. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
Travel day! As usual I managed to wake up before the alarm. I'd showered the night before so I packed up the toiletries, CPAP, laptop, and phone; set the thermostat to heat to only 60F; loaded the car; and drove off to Detroit Metropolitan Airport.
Traffic was moving at or above posted speeds. The bus to the McNamara terminal took a little while to arrive. Once I got to the right terminal I did self-serve bag tagging and dropped off the bag.
Badge pickup opened at 5:00 p.m. and mine printed after I scanned the QR code they'd sent. There was a welcome reception from then until 7:00 p.m. with nibblies (mostly crackers, cheese, and nuts).
After the reception I headed back to the room to write up some of the trip report and crash.
Opening Remarks
[No slides | No video]
Despite my challenge for the co-chairs to be the first in recorded history to begin the opening remarks on time at 8:45 a.m., the tradition of a late start continued; they didn't start until nearly 8:47 a.m. Nevertheless, the conference began with the usual remarks from the co-chairs, Patrick Cable from DraftKings and Laura Maguire from Trace Cognitive Engineering. They reviewed the code of conduct; we wanted to be a safe environment for everyone. The Slack channels are where changes are announced and the talks' Q&A will take place (#26amer- was the prefix and we used the :question: emoji (❓) as the question prefix; moderators asked them on our behalf.
Thanks to the program committee, track co-chairs, and room captains for putting together the program; the USENIX conference office for all the logistics; and the sponsors (the showcase will be open all three days, including a happy-hour this evening). Birds-of-a-Feather sessions (BOFs) are tonight and tomorrow.
We have 814 people in attendance (up significantly from around 550 last year and around 600 in 2024), 81% of whom are new to SREcon, and only 33% of whom are here with one or more colleagues, representing 17 countries and 274 institutions.
Plenary: Taming the Unpredictable: Reliability in Chaos
[Direct slides link | Direct video link]
Michelle Brush, Engineering Director at Google, said that the increasing volume of code written by AI will only accelerate the complexity of our systems. We are moving beyond predictable systems that can be managed with traditional methods like thorough project plans, runbooks, and unit tests, into an era of truly complex systems that are vast and difficult to comprehend fully. These immensely complex systems will behave almost nondeterministically. We need new strategies.
Her presentation delved into why robust reliability practices are not just helpful but essential for navigating this explosion in complexity. It shared strategies for conquering unpredictability including building rigorous evaluations, implementing generic mitigations, recreating reproducibility, and developing software with a risk-first approach. Most importantly, it discussed why humans will always be critical to taming the ever-growing complexity.
Our job as SREs is to worry about how things will fail and recover. Any line of code can be load bearing, even if written by LLMs/AI, with all the assumptions the engineer made when they wrote it (which may be valid at code time but not still valid months or years later).
Systems will get bigger and more complex# faster than ever. That leads to having to deal with emergent behaviour in systems and organizations, making it harder to understand cause and effect, or why something happened. This can feel nondeterministic or chaotic.
As systems fail we document undocumented assumptions and rewrite policy, causing the systems to grow, and eventually the inner "small" system is a black box.
We can ask agents to find and fix things. For example, a playbook may contain outdated commands, so we could run an agent across all our playbooks to fix them all at once.
Treat the complex work as hypotheses/experiments. Our jobs will move from being the human feedback loop to building feedback loops that scale with change.
Test your generic mitigations (e.g., turn it off and on again) as if they're still fitness functions, Other fitness examples:
- Add an arbitrary number of clusters.
- Support multiple OSes.
- Survive live migrations.
What isn't changing? We still need to be the people to show up, respond when things fail, and get things working again.
Plenary: Mean Time to WTF: Why Developer Experience Frameworks Belong in Your Incident Retrospectives
[Direct slides link | Direct video link]
Dr. Nicole Forsgren defined SREs as developers who happen to focus on reliability. You write code, build tools, manage complex workflows, and experience friction every single day. That friction has a measurable cost: slower incident response, more mistakes under pressure, higher on-call burden, and ultimately, degraded system reliability.
She also wanted to address the elephant: AI is generating way more code, leading to more deployments and bigger changes to your systems. If your operational friction is already high, AI is about to amplify it exponentially.
This talk applied developer experience frameworks (like SPACE and DORA) to operational work. She discussed how to measure SRE experience, connect it to reliability outcomes, and identify high-leverage changes to reduce friction. We could walk away with practical things to do tomorrow and a framework for making the business case that investing in SRE experience is a reliability strategy.
Friction (error under pressure) is a reliability risk. Failures — like cognitive overload, tool failures, and process bottlenecks — can affect every incident the SRE is working on. She used the 2012 $440M loss from Knight Capital Group as an example. Is observability built into your new systems?
Technology is easy; people are hard. Technology moves fast... but people don't.
AI multiplies friction: Deployment frequency is up, incident rate also goes up, and on-call pressure compounds. AI can make our friction more expensive (for example, deleting a database despite being told not to). It changes the incident types we see. AI-generated systems are opaque by default (as opposed to human-understandable because they were human-written). Runbooks accounted for known failure patterns, but as systems change there are more and unique failure modes. diff output was readable in the past to see what changed, but now it's hard to reason about 1000+ line diffs under pressure.
The WTF has to be visible (if not measurable) to be fixable. "Mean Time to WTF" is a new metric: How long does it take to go from receiving an alert to "I understand what's happening." It can act like a leading indicator of how brittle a system is. Some low-cost high-signal measurement starting points are:
- Add a friction contribution field to the retrospective template.
- On-call survey regularly (e.g., quarterly): What slowed you down, what would have helped? What did you have to invent? How much did you swear?
- Toil tracking system: Time-box a sprint and ask everyone to log manual repetitive work just to inventory it. (AI can help here.)
Now that we have an understanding and idea of our opportunities we can make the case for funding (timing: do it near an incident so we're not invisible). CISQ found $1.52T in the US for annual technical debt. Developers feel about 68% productive; the remaining 32% is about $300B in lost GDP. It's not "make the developer happy" it's about "staying competitive."
Productivity metrics can be gamed; experience metrics less so. It's not about measuring output but removing obstacles.
Stakeholders will often fund reliably if properly framed. They're less likely to fund happiness. To make the business case:
- Visibility and accountability. Use competitive dynamics and public comparison.
- Simple data, clear action. Progressive disclosure, right detail at the right time.
- Dollar impact. Speak the language stakeholders understand.
One example is Amazon's Dave Anderson. He had the mandate and data but no authority to change the VPs' roadmaps or priorities. He generated monthly S-Team reports that included the VP or director in charge of where there were gaps. Directors and VPs wanted to get off the report and adjusted their roadmaps and priorities to do so.
Some tasks we can do right now:
- Open a runbook and see if it still reflects reality. Can a newbie understand it? (If not, that's one piece of friction to improve.)
- Talk to a peer and ask what friction impacts them most. If their answer differs from yours, that gap is an opportunity.
- Start a personal friction log, time your MTWTF for a real incident just once, and declare a "toil hour" just to raise visibility (not to fix yet).
- Find our cost-of-downtime number, how it's calculated, and use that to build the business case. "Every hour of downtime costs us $X, and friction adds Y minutes per incident. That's our annual friction tax. Here's how we fix it."
Also:
- Individual contributors can map their workflow for a week to see where they lose 30+ minutes.
- Team leads can do a 30-min retrospective asking "What slowed us down that wasn't the work?" Start a friction log.
- Leaders can set up 3 listening sessions ("go grab coffee") to napkin-math stuff.
Varieties of SRE
[No slides | No video]
Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.
After the morning break, Kurt Andersen and Sebastian Vietz spoke about the varieties of SREs. Ask 10 people what SRE is, and you'll get 11 answers. There is no single "correct" way to do SRE. From embedded engineers to centralized consulting teams, every organization adapts the core concepts to fit their reality. This unconference discussion session invited us to peel back the label and examine the reality of the role. This was a facilitated discussion about the different flavors of SRE, how our organizations define it, and we debated whether the title still fits the work we do today.
Goals: Make one connection with someone at the table and take one insight away.
We had eight 10-top tables in a mostly-full room, and ran through three exercises:
- Declared versus actual — What does the organization say your role is, versus what your actual role is, as an SRE?
- Explore the aspects of varieties — Structure (centralized, embedded, consultancy, or platform) versus influence (advocate, champion, ambassador)
- Kill, keep, or evolve — What should we remove from the SRE scope, keep in the scope, or evolve from? Should we keep the name "SRE" (especially since "S" is overloaded and can mean site, solution, system, software, or strategy).
Executing Chaos Engineering in Production at a Critical Financial Institution
[Direct slides link | Direct video link]
After lunch I went to Leonardo Marques' talk about Chaos Engineering. He wanted us to discover how Chaos Engineering transformed a high-stakes financial ecosystem at Bradesco processing thousands of transactions per second. This real-world case study unveiled a reproducible framework for risk-averse organizations, blending fault injection, automation, and observability.
Key takeaways included safe experiment design with governance guardrails, automated chaos workflows, and multidisciplinary GameDays. Results: 73% reduction in MTTD, 10 hidden vulnerabilities exposed, five new metrics, and a shift to proactive reliability.
He provided a compliance-friendly methodology to turn failures into insights, bridging theory and measurable business impact in critical systems.
From Thundering Herd to Zero Outages: Building Reliable Inventory Sync
[Direct slides link | Direct video link]
Rushikesh Ghatpande from Broadcom (VMware before the buyout) said that managing accurate inventory across distributed infrastructure is critical for security policy enforcement and operational reliability. Enterprise datacenter software requires centralized policy management across thousands of servers and hundreds of thousands of VMs and containers, yet resources are distributed across multiple data centers.
This talk shared a battle-tested inventory synchronization protocol that evolved over six years of production experience, handling real-world challenges from thundering herd problems during full datacenter restarts to fairness in queue processing. The protocol uses a 5-stage finite state machine to ensure reliable, consistent inventory sync while preventing system overload.
He explained how they evolved over two years from a naive 3-step process to a robust 5-stage protocol, how they solved the thundering herd problem, ensured fairness in queue processing, and separated connection establishment from application readiness. He shared empirical analysis that led to specific timeout values and demonstrated how bidirectional communication patterns eliminated message ordering complexity.
This protocol has been validated at scale across 10,000+ servers with zero customer escalations over four years. The patterns are immediately applicable to any distributed state synchronization challenge — whether managing VMs, containers, network devices, or any distributed resources.
From ESXi to Kubernetes at the Edge: Modernizing 1,400 Edge Locations with Open Source
[Direct slides link | Direct video link]
After the afternoon break, Jasjit Singh and Vishal Jethnani, both from Loblaw Companies Limited, Canada's largest food and pharmacy retailer, talked about their migration from ESXi to Kubernetes. When virtualization licensing costs increased tenfold, a large retail organization rethought how it ran infrastructure across 1,400 edge locations. This session shared how existing store hardware was transformed from a VMWare-based setup into a fully open-source, Kubernetes-powered platform - without adding new hardware or disrupting business operations. They explained how ESXi was replaced with Kubernetes clusters, centrally managed using GitOps with FluxCD, and how open-source tools like Dex, Helm, Grafana, and KubeVirt were used to deploy and observe workloads at scale. The talk focused on practical SRE lessons around automation, resilience at the edge, and cost-efficient modernization that attendees could apply to their own distributed environments.
Before: A typical store had a self-contained ESXi environment of two nodes, supporting the pharmacy and POS applications and DB2, with shared storage. They were monitored by IBM's ITCAM. It was optimized for stability not flexibility. The migration required touching every store.
Constraints: No budget, didn't want proprietary software, and no tolerance for downtime. Also, the Pharmacy Management System was in-house legacy code, and they had no time to modernize apps to microservices. Edge realities meant limited compute and storage, constrained network connectivity, no on-site support, failures must be recoverable remotely, and every store had to operate independently.
Answer: Simplify, eliminate custom code, and go open source: OpenSUSE as the foundation OS, with Kubernetes on top of it, KubrVirt for VM management, and the applications sat on top of that. Every store is now a small, declaratively-managed data center.
The migrations took place in four steps, each of which was reversible:
- Pre-cutover — Introduce a third node in the store (whose prior workload was migrated to the cloud), then set it up with Kubernetes to it. Move individual components over and modernize, validate workloads, move some of the applications to microservices to reduce dependencies.
- Cutover night — After the store closed, consolidate ESXi workloads on one node, expand the Kubernetes cluster to a 2-node cluster, sync data, switch IP addresses. (The old VMs on the shut-down ESXi node remained.)
- Safe end state — Apps are on Kubernetes and the remaining ESXi node was intact.
- Day-2 state — Once the store was happy, convert the final ESXi node to a third Kubernetes node in the cluster.
How did they scale to 1,400? A small (5-person) focused team, a TPM, and four deployment specialists. Having repeatable automated processes was essential. They started slow with one store per night, refining run books and automation. Once they were happy they went to 10 stores per night. At peak they were doing 40 stores per week.
The impacts were measurable: No OS, virtualization, and monitoring costs and using a standard platform, supporting modernization without another migration.
Key takeaways: Apply SRE principles at the edge, make small architectural bets for the large-scale impact, and have a framework for modernization readiness. Modernization isn't about new tools but reducing operational surface area. Execution is more important than the technology stack.
Beyond Blanket Freezes: Enabling Safe Innovation During Critical Events at Netflix
[Direct slides link | Direct video link]
Prachi Jain and Sandhya Narayan shared how Netflix is replacing blanket deployment freezes, a common safety lever during high-risk periods that they often slow teams down and create bottlenecks, with data-driven, service-specific risk management that enables safe, continuous delivery even during critical events. They discussed how to classify services by risk, integrate this into CI/CD, and empower teams to ship quickly without sacrificing reliability.
Not all systems play the same role. For example, customer-facing systems are riskier to update than internal-only systems. An all-or-nothing freeze can delay security updates and stack critical updates (bug fixes and feature requests) in a backlog that all deploys once the freeze lifts, making it harder to identify any breaking changes.
Rather than an all-or-nothing freeze they categorize based on risks:
- Tier 0/critical — customer definitely notices (e.g., the play button breaks)
- Tier 1/high risk — tightly coupled to core experiences
- Tier 2/medium risk — important infrastructore/platform, with more indierct customer impact so you have some time to fix before customer notices
- Tier 3/low risk — internal only and not affecting customers
Classify risk so not everything is treated as being as important as the play button.
The four risk signals are:
- Deployment confidence
- Test confidence
- Change type (and scope)
- Historical behavior
This gives a more nuanced decision as to risk (low, medium, high) to reach a better decision.
Bypassing a freeze becomes a business decision: Look at event type, service tier, risk signals, and resilience tactics and compare it against the rubric. It's built into the deployment pipeline for consistent, data-driven decisions.
That leads to a feedback loop: After big events they review everything to see what worked, what broke, and what was unnecessarily stressful. Was it mis-tiered? Did the risk signals fire too often or not often enough? Wire any changes into the CI/CD workflow rubrics.
In summary: Make intelligent risk- and data-based decisions. Automate the risk signals (and maybe include a security flag). Adopt essential resilience tactics to detect early before a regional or global outage. Listen to your team. Start small and incrementally build.
Reliability in the Big Leagues: How SRE Powers America's Pastime
[Direct slides link | Direct video link]
Jessica Johnson and Chris Alexander said that in professional sports a "timeout" is strategic, but "downtime" is a disaster. At Major League Baseball, SRE covers the whole field — ensuring real-time data integrity for millions of fans and powering the critical on-field technology that impacts every play. This session went beyond the dugout to reveal how they built a Major League SRE practice from the ground up. They share their journey of rebuilding trust through psychological safety, how they "score the game" using advanced SLOs, and how they are now moving to the offensive as we approach Opening Day. They gave us a look at the unique hurdles of sports tech and shared a gameplan for fielding a championship-caliber SRE team.
Building reliable and resilient systems is a shared goal; the SRE team is just one piece of the puzzle. They work closely with the platform team. The talk is about their reliability challenges and technologies.
MLB operates at the intersection of live entertainment and high-scale technology, servicing a massive interconnected ecosystem. Their technology innovations include:
- mlb.tv launched in 2003, before YouTube and Netflix, they streamed Yankees vs Rangers to 30k viewers. From an SRE perspective this is what established them as an innovative tech company. This eventually helped power HBO and Disney.
- Replay reviews expanded beyond just home runs in 2014. 10 isolated and 4 high-frame-rate cameras provide a variety of angles to the umpires in the Remote Operations Center, and it keeps the reviews to <=90 seconds, providing a live look into rationale.
- Statcast in 2015 made tracking the ball, bat, and players into a data-driven science. It looks at the process, not just the outcome of the play. Angle of hit, bat speed, spin rate and rotation of the pitch... all generating about 7TB of data per game. (This technology is Emmy®-winning.)
- Automated balls and strikes in 2026, starting Wednesday, March 25 (tomorrow), they're using technology to define the strike zone consistently, and the pitcher, catcher, and batter can challenge a call and get it reviewed within 14 seconds. Technology is now responsible for one of the biggest decisions (ball or strike).
Reliability hurdles included:
- In versus off season traffic swings make defining a normal baseline difficult. On opening day they had 10.8M unique users with up to 6,.7M requests per minute. They deliver about 40k live data packets per game in real time. Partners require 99% of all game updates to be delivered within 2 seconds of the event occurring on the field. There are few events like opening day, the draft, the home run derby, and the World Series.
- Their technology is distributed and tiered. Tier 1 (global) is 24x7, but tier 2 (event) is only critical during a live game. Every setup is different.
- They manage it with schedule-aware monitoring. They manage 10,232 monitors without alert fatigue; they leverage game-day metadata to drive context-aware thresholds, enabling alerts 13.5 hours before pitch 1 and mute 1 hour after the final out, and mute at night based on latency/error thresholds.
How did we get here?
- Rebuilding years: In 2020 the organization went through a transition, where Disney acquired the technology leading to leadership changes.
By 2022, "SRE" was just Jessica and a team of 3 observability engineers. They had a lot of operational toil with no time for proactive improvements. They had to pivot to (a) focus heavily on solving their evolving tool needs and (b) build trust through creating psychological safety and Learning from Incidents (LFI).- Now, they have a front office (technology is 400 people, including Infrastructure Engineering (Jessica's team) and 7 other departments reporting to the CTO. Her team is 6 FTEs and 2 consultants. They cover the whole field, acting as a force multiplier.
What is (and isn't) SRE and MLB? They wear many hats: engineers, evangelists, educators, change agents, and consultants, but they are not relievers or other teams' on-call support, Their vision is to improve the fan and partner experience by making reliability and performance bedrock features of all their technologies. They have four core values: integrity, leadership, collaboration, and empathy.
They track progress with blameless incident metrics. They needed standardization and data quality. You need context to figure out the "why."
- In 2025, they established SLI coverage for tier 0/1 services and then evolving the SLO targets and working with teams to hit them consistently.
- Dialing in the thresholds for SLOs and measuring the front-end user experience is a challenge.
- Use the off-season to innovate; hold an "Operational Excellence" meeting monthly.
What's coming up?
- Expand the service offerings, including a testing suite focused on chaos and performance testing — Leverage AI to maximize tools and reduce toil, and move faster
- Embed SREs with teams that support critical technologies for targeted short engagements, helping them to quickly improve their reliability posture in a short period.
- Shift left to work with development teams much earlier in the process to ensure reliability best practices are embraced from the ground up.
Key takeaways:
- Build your reputation.
- Establish psychological safety.
- Get good at driving organizational change.
- Be data driven... even if you don't have the data you want.
- Be strategic.
- Hire the right coaches.
Ghosts in the Interview Loop and Avoiding AI Taylorism
[Direct slides link | Direct video link]
Andrew Hatch from Cisco ThousandEyes said that AI tools are now living in our interview loop, reducing once rigorous SRE coding interviews to something solved by an LLM in seconds. Remote interviewing only makes cheating easier, harder to detect and System Design interviews are fast-following as tools grow in sophistication.
In 2025 ThousandEyes collided with this reality. AI-assisted cheating, broken coding rounds, requiring them to pivot to take-home challenges that test thinking, not just output. This talk discussed interviewing and the emerging threat these tools are leading us to — a new dawn of Taylorism, albeit wrapped in cheerful Agentic/LLM chatter (as opposed to checklists and a stopwatch), reducing skilled engineers to mindless prompt operators, eroding expertise, lowering morale and, fragile systems and codebases drowning in a sea of AI slop.
He weaved humor, critique and lived experience in his talk to encourage debate on what we are hiring for in this new era of our industry.
Hiring used to involve some kind of coding, networking, hardware, distributed systems design, the human factors, and incident management. They were almost always in person, at least in part. You could gauge how they thought, how personable they are, how shifty they are, is who he says he is, and so on. Often there would be some phone or Zoom screens, then he had to fly from AU to California to have six hours of back-to-back in-person interviews.
And then there was COVID. But we kept doing the same process, but it was entirely online. And now we have AI, so the previous structured interviews don't work as well: when they got to the coding round, strange and unexpected things would happen. They'd say "Please don't use AI tooling for the coding questions." The candidates would be idle for 10 minutes but their eyes would move. They weren't typing on the screen. That made the interviewers suspicious... and then the code came out perfectly because they put the code into ChatGPT.
They changed their interview: Watched the movements, ran problems through multiple AI tools, and then sent them off with a take-home assignment.
This needed more thought:
- Do we ask questions that used to take a week that now takes 30 minutes with AI?
- Do we bring people back in for coding and systems design interviewing?
The irony is that there's a lot of top-down pressure to use AI, and no-one wants to be left behind. But telling candidates not to use it but then demand they do on day 1, how do we access actual expertise? They invest hard for expertise and don't want to engineer it out of the role. Also, what happens to creativity, innovation, and critical thinking? They don't just want technical skills but for those as well.
This leads to the war on expertise. We've engineered it out for centuries. The printing press is a great example: Only a few could write calligraphy, but the printing press spread literacy to the world. The spinning Jenny, arc welder, power mitre saw, and GPS navigation are all tools that made things more efficient. But our distributed complex system expertise is different [dynamic stochastic emergent unstable non-linear heterogeneous evolving context-sensitive, and so on].
Complexity is hard. We simplify. The 5 Whys MTTR metric is garbage; it's a way to diagnose an issue... some of the time. It's good for simplifying. We rely on simplified versions of complexity for a long time, and it does help... but someone (like SREs) still needs to understand the complexity.
What if something could reduce the reliance on expertise... like AI?
It's not really a silver bullet. AI can handle the simple case but not the actual complexity. See Taylor's 1911 The Principles of Scientific Management. A lot's been written about how separating thinking from doing isn't really workable in the long run. If we blindly accept what the technology says to do we may have problems.
Taylorism works for linear repeatable processes (if you don't care about worker engagement, happiness, or innovation). That doesn't describe modern software systems, and hasn't for decades, especially with the real life complexity.
AI is really good at generating plausible bullets. The cost is the difference between apparent certainty and actual understanding will continue to grow, leading to a true knowledge gap.
AI gives many workers the illusion of expertise. We're getting a lot of pull requests that are junk, turning infrastructure-as-code into a giant bucket of slop. We've seen cognitive offloading in schools: by not building mental models and delegating cognition to a tool means you'll do worse on the exam in a few months. But learning requires real world experience; it's messy and it breaks, but it's actual learning. Understanding comes through trial and error, adaptation at the edge, and testing. This is the fundamental mindset of good SREs.
When does AI stop being a tool and become the operator? And how can we recognize this? We need to know when the tool is helping versus just giving us the answer.
Who's responsible when the operator (now AI) is hallucinating? And who has the expertise to fix it (especially since we outsourced all the expertise to AI)?
Complex System control planes become as complex as the systems they represent (even if they look simple).
SREs still need complex system skills:
- Understand interactions between components.
- Adapt to dynamic conditions.
- Think critically and make tradeoffs under pressure.
AI is a great tool... but it's only a tool. It must not be the operator lest it enshittify the production system you're accountable for. We have to adapt:
- Understand the impact on hiring and recruiting.
- Be clear on what we cultivate and incentivize with AI and its impacts.
- Know distinctly what the value of AI is to your role, teams, and business.
- What matters is how we adapt to it, not how we fight it.
Humans are and have always been the most adaptable components of any complex socio-technical system. The ability to learn and build expertise is critical.
So You Want a New Incident Commander-Lessons from Building Incident Response Teams
[Direct slides link | Direct video link]
Vanessa Huerta Granda, Technology Manager for Resilience Engineering at Enova International, spoke on Incident Command (IC). It isn't a badge for the most senior engineer but rather a sociotechnical leadership skill that keeps teams aligned, reduces cognitive load, and builds trust during outages. This talk shared lessons from a decade building IC programs across SRE organizations; including how to identify, train, and support effective Incident Commanders without burning out your best responders.
Vanessa was the only incident commander and the only one doing post-mortems at her company. It was fun at the time, but looking back not so much: She was a single point of failure. How do you dig your way out from that, especially given that incident commander is a socio-technical role. Their incident response system is different with many trained commanders. Incidents are calmer, communications clearer, and no single points of failure. It took time because they had to rethink the role.
Incident command is a sociotechnical role for a sociotechnical problem... or when humans interact with a complex system under pressure. Incidents often start as a technical failure, but it can quickly become an organizational event with different people wanting updates or suggesting changes or communicating different things to different people. You're coordinating a group of humans interacting with a complex system under pressure.
Why do we need them? To create the conditions for the technical experts to do their work effectively. They coordinate the people, information, and decisions so the engineers can focus on the remediation. They're more like an orchestra conductor: they keep the tempo, making the performance work as a whole, despite not playing the brass and the strings.
The incident commander role sits at the intersection of people (focus and avoid duplicate work, communication, and decision flow), systems (shared situational awareness), and the business (knowing what matters to the organization: financials, customer service, regulatory and compliance, and so on).
Anti-patterns: The commander doesn't need to be (and in fact likely shouldn't be) the strongest engineer (who should be working on the problem), anyone on-call (the incident commander is a skill not an operational responsibility), or the most senior person (who can be perceived as scary and who may have strong opinions about technical or organizational solutions). The incident commander should be facilitating the discussion.
So what should we look for? Skills in communication, sociotechnical leadership, and cognitive load management. And you need more than one person!
How do you convince leadership to do this? It's a business decision, so speak business: what's the cost of uncoordinated events (not just to the bottom line of revenue, but the engineering cost), evidence for what happened in events with and without an incident commander, and start small and prove your value (you don't need five new headcount on day one).
How do you build a sustainable incident response team?
- Choose the right structure; without that it becomes an extra and leads to burnout. It can be based on company size, team structure, and frequency of P0 events. The structure may need to change over time as the conditions change.
One is a deliberate team with explicit responsibility. Allows for intentionality, builds institutional experience (the more you do the better you are), but it's a trade off with overload; you need enough people for enough incidents.
Another is a per-domain team. Teams run their own incidents. It's good for strong ownership boundaries. The commander has the extra implied context. The trade off here is consistency across teams (so look at shared training and tooling). Also, how often are the ownership boundaries really that strong?
You can also ask for volunteers. It helps spread skills and strengthen the culture of shared responsibility. The trade off here is that it can be very stressful, especially since people come in with different skills and expertise levels, leading to inconsistent coordination.
Regardless of the structure, the team needs to know this is a priority (part of their "real job").- Choose the right people: it's about finding communicators and coordinators who can keep the situational awareness.
- Communication to different audiences (customers, internal, and leadership); can people explain their work? can they talk both tech and non-tech (business) language?
- Sociotechnical leadership (often underestimated; ability to bring up issues to different groups, to identify problems outside of the purely technical)
- Cognitive load in keeping the extended context and status at hand (to answer the "what's the status" on demand; can they switch work stream and still know what's happening where)
How do you identify people with those core competencies? It can be from inside or outside your organization. The former you can look at their actual results; the latter requires some interviewing, like "Given this incident report, write updates to different stakeholders," or a tabletop exercise with a mock incident to see how they structure their responses, and so on.- Train the people: you need to allow the new commander to develop the necessary skills. They need the basics about role clarity and the technical and business basics, and hands on with the tools (shadowing, reverse shadowing, going on call, having guidelines about what things to think about).
The goal is to build sustainable capacity over time, not to be perfect.
Epistemology of Incidents and Problem Solving
[Direct slides link | Direct video link]
After the morning break SRE Jack Kingsman from Atlassian spoke about incidents. In high-pressure incident response, critical fundamentals of thought, action, and communication matter more than ever. Engineers need concrete criteria and examples of how to think and problem-solve during incidents, and answers to big questions: What fundamental decision-making loops should we orient ourselves in when disaster strikes? How do we reason about trapping the location and cause of unknown issues? What specific qualities make a hypothesis good or worthwhile, and how do we construct effective tests to prove or disprove them? How can we structure notes and progress updates to provide the most signal and the least noise in fast-paced situations?
Drawn from nearly a decade of experience in Site Reliability Engineering, from small startup to publicly traded SaaS firm, this talk helped attendees level up how we think and act when it matters, and equipped us with the concepts to teach those skills to others.
The core of good incident management is knowledge. This talk was about how we know things (epistemology is the study of knowledge):
- How do we find truth in an incident?
- How can we be certain it's the truth?
- How can we level up our communication and teammates in the process?
Takeaway: Take away a new paradigm or script for yourself or your teams.
An incident loop gives a road map about what to do at a given stage.
Phase 0 is the detection and declaration of an incident. It boils down to two universal things:
- Codification: write things down to define the standards (alert fatigue, escalations, appropriateness for the receivers (what teams should get which alerts), severity matrices and escalation criteria, and philosophy as to the whys of the organization)
- Teamwork culture (collaboration)
Phase 1 is survival and triage. Keep the boat afloat (survive first): Buy time in reversible ways: Rollbacks, scaleups, circuit breakers, access denial, load shift, fall over, etc.
Phase 2 is examination. Gather evidence, organize our thoughts and observations, laying down the right information in the right way to make the solution obvious:
- Think in systems and chunks. (Architectural details). Network, edge, up- and downstreams, servers and services in silos, compute or storage, and so on.
- Look at your observability and error reporting AND its hierarchies. Don't overlook the inconsistencies and what they tell you; two different sources telling you different things is itself useful information.
- Keep great personal notes (scratch pads). Record-keeping is vital during an incident. Keep them in public too. He uses a grid:
Standardized; include timestamps and hyperlinks. (State doc? Google? A Markdown editor? Just lists?)
Evidence Correlated/Causal Nominal Odd but uncorrelated
Not seeing some common thread? Consider broadening thought patterns and getting others in the room to get their opinions. "I don't know" can be very powerful.Can AI help in a mechanical sense (does the collected facts imply something useful to the diagnosis/hyphothesize step)?
Phase 3 is diagnosis and hypotheses:
- What are our search patterns? Linear (edge > auth > LB > compute > cache > storage > egress), binary search between chunks to find where the problem is, or induce a change (more of a test) to get more information.
- Think in proximity and correlation: Time (when did it start, did something happen at the same time? Is it midnight UTC), space (what's happening where), and action (did someone deploy something? scaled up or down? changed a feature flag?)
- Don't stop early. Complete the search before stopping the diagnosis step. Investigate what you find (e.g., X in auth is a possibility, but keep looking at the rest of the list).
- What makes a good hypothesis? It must be testable and it must be relevant and specific (based on available information).
Can AI help in a mechanical sense (what sense can it make for diagnosis)?
Phase 4 is testing and treating. What makes a good test?
- It must work on a hypothesis.
- It should favor the likely over the unlikely and favor fast over slow. ("When you hear hoofbeats, think horses not zebras.")
- It should be mutually exclusive: A positive result should prove and a negative result disprove the hypothesis (or the reverse).
- A test should be free of compounding factors or dependencies, testing the thing they say they're testing.
- A test should be as free from side effects as possible. Verbose logging is wonderful until you max out your Splunk limit.
- A test should have clear outcomes and markers of success. Don't move your goalposts on the fly.
Track your results. Data is gold, even the failures. Put it in the Slack war room, your incident state document, your own timestamped notes. (You'll need it to bring new people up to speed and for the post-mortem.)
Be proactive. When you're waiting on a test result, don't just twiddle your thumbs. Stop and think about the outcomes about what you'll do next. What if it succeeds? What if it fails? Who will you bring into the war room? What will you do next? Consider doing it now.
Can AI help in a mechanical sense (develop a test)?
Incident communication skills include:
- Annotate your waits in timestamped forms. (Slack is great... if the participants used it and not just a video call you weren't on.)
- Make plans with momentum. Lots of people have great ideas. It takes work to put an idea into motion. If it needs input or validation or permissions, include the specific unknown and action, and direct the request to a specific person with a time-window for a response.
- Ask questions that drive answers, not silence. They should be psychologically safe. Phrase open-ended questions in ways that look for answers. Encourage people to answer. Instead of "Should we scale up" ask "What are the downsides of scaling up." Answering "Any questions?" takes a lot of bravery to answer; what about "What should I clarify?"
Incidents are all about knowledge: What we know, how to get that knowledge, and possibly asymmetric knowledge.
Note: This is not prescriptive, but a good framework.
Human Factors in the Age of AI Ops: Re-Engineering Trust between Humans and Machines
[Direct slides link | Direct video link]
Eddie Redick from CTC Ops was up next. When everything fails at once... cascading service degradation, overlapping automations, and an over-eager AI auto-remediator — you don't rise to the level of your architecture; you fall to the level of your systems thinking.
As AI and automation become deeply woven into the fabric of reliability engineering, teams are learning that convergence isn't just technical — it's cultural, cognitive, and procedural. What happens when human intuition collides with machine logic in the middle of a P1?
He has witnessed countless times where precious outage minutes are wasted chasing false positives. AI is garbage-in/out, or, as he always says, "Only as smart as you feed it."
Engineers can spend countless Scrum hours planning and building the best next-gen tool, only to have it fall flat. Non-structured data is tricky. If not architected effectively or starved, it will waste away its ROI.
Did the AI auto-remediator make it worse? Were you chasing false positives? Are there cascading degradations? What if there are overlapping (and possibly contradictory) automations?
There's a gap between adoption velocity and building trust. 68% of organizations want AI but only 16% trust it. 62% say that trust has increased over the past year. Fewer than 1/5 fully trust AI to act autonomously.
The trust triangle is between logic, empathy, and authenticity, but:
- Logic breaks when AI can't explain itself.
- Empathy breaks with too much cognitive load.
- Authenticity breaks when it fails silently.
For example: AI softens an image, then changes the background (and the person), then revamps the whole image... badly, and all without telling you.
65% of employees think they're building on solid data, but are they?
75% of leaders think their teams need more training.
Tech debt is a silent killer; old once-accurate docs may no longer be. What if archives get pulled into the current production environment.
Alert fatigue is another silent killer: 73% of alerts are false positives (up a lot in the past year) -- more AI, more automation, and more noise. 30% of those are not even investigated.
Urgency needs to be appropriately scope; urgency x impact should inform the engagement model, so you don't run a P3 alert as if it were really a P1.
Toil is increasing (in 2025 for the first time in 5 years):
- 24% of time spent on toil
- 49% said AI reduced their workload
- 70% are stressed from on-call and burning out.
Commanding the chaos (CTC) is his framework for resilience when humans and machines converge:
- Psychological composure. Fatigue is real. Leave emoptions and titles at the doro.
- Systems thinking. Consider what's where, what might be having maintenance and reducing capacity, what hardware failures might happen, etc.
- Automation at scale.
Anyone can run a DRP TTE... but could you run one for real? Even as a drill?
Need to avoid reactive mode — chasing each alert individually, letting the auto-remediator run unchecked, assuming the AI knows best, skip correlation and go straight to fix, panic-driven decisions — to systems thinking mode — mapping the blast radius first, correlating signals before acting, having a human validate the AI hypothesis, having a single incident commandeer own the process, and understanding dependencies before acting.
80/20 rule: 20% is technology (tools, platforms, AI models, automation, and so on) and 80% is redesigning the work. You can't just drop a new tool on a broken process; the divergence can involve culture, process, people, and governance. Shiny and new is fun, but be intentional about it.
Building a trust bridge involves:
- Observe (AI monitors and alerts; humans decide and act)
- Advise (AI recommends actions; humans validate and execute)
- Assist (AI acts with guardrails; humans supervise and override)
- Partner (AI drives humans trust)
How is trust earned?
- Transparency and clarity
- Override paths
- Guardrails (!!)
- Feedback logs (learn from actions, even mistakes)
Playbook to reengineer trust:
- Audit your signal to noise ratio in alerts.
- Define your AI autonomy levels. What phase/s are you in?
- Build human override protocols.
- Invest in systems thinking training.
- Measure trust, not just MTTR. (This and MTWTF are new useful metrics.)
Takeaways:
- Trust is earned not bought.
- Alert fatigue is real.
- The 80/20 rule in decision making is real.
- Human intuition and machine logic are complements not competitors.
- Composure + System Thinking + Scaled Automation = Commanding the Chaos.
From Chaos to Confidence: How SREs Can Leverage 50 (and Counting) Failure Scenarios to Test AI Readiness
[Direct slides link | Direct video link]
After lunch, Rohan R. Arora and Bhavya from IBM Research talked about how they used sandboxed Kubernetes environments to create over 50 production-inspired failure scenarios that put AI assistants to the test across the full SRE toolkit. The results? Current AI models like ReAct resolve only 13.8% of scenarios — a reality check for anyone evaluating these tools. This session introduced their evaluation framework and showed how they use it to benchmark AI assistants against real failure patterns, chaos-test their own applications with production- inspired scenarios, and assess whether AI-assisted approaches fit their operational needs. They're building a community-driven repository where SREs contribute real incidents and advance the field together. They wanted us to learn what AI can (and can't) do today — and to help shape what it could do tomorrow.
AI makes a lot of marketing promises... which don't come true. They claim it's 80-90% faster, but they're at best 42% accuracy with 30% more operational toil. LLM context is key; throwing data at LLMs isn't useful, especially when observability data is noisy or irrelevant. The bottleneck is context quality, not model capacity. Garbage in gives confident garbage out. If you can't replay a fault you can't measure progress. Synthetic tests what really happens in production.
When the Cure Is Worse than the Disease: Metastability in Recovery
[Direct slides link | Direct video link]
Todd Porter from Meta and Aleksey Charapko from the University of New Hampshire were up next. Dealing with failures is an inevitable part of operating large distributed systems. Luckily, such systems are designed to handle failures and recover from their effects. In this talk, they explored the unfortunate cases in which recovery actions intended to address problems, unbeknownst to the operators, become the cause of even larger failures. This process occurs through natural recovery cascades in large systems, in which the recovery of one system or component triggers recovery in the next. They showed that, via recovery cascades, systems may amplify the recovery cost at each step as the process crosses from one system to another. Moreover, these amplifications can propagate backwards into the systems that have already recovered, creating positive feedback loops that reintroduce and reinforce the failure.
They explained their findings, failure causes and contributing factors, and mitigation strategies using a global-scale message bus that experienced such problems as an example.
They have a process and requirements. There was an incident. They attempted to recover but it didn't work as expected. They issued backfills to correct the corrupted data, but in this case things got worse and it eventually escalated to a widespread incident.
So why weren't the backfills actually working and letting them recover. That was a metastable failure — a performance failure sustained by a positive feedback loop, where the initial trigger started things off and led to an overload or exhaustion of some resource. "Retry" is a common one, often baked into the protocols; it added more and more retries which slowed things down more and more as the load went up and up. Eventual recovery may require literally turning it off and on again.
Given all that, we build fault-tolerant systems... but even their recovery mechanisms may be metastable. A retry is good for intermittent failures, but once it's overloaded the retries only make things worse and amplify the situation, leading to cascading failures.
Aleksy notes that the dependee provides work to a dependent system, and that dependent relies on work or data supplied by the dependee. When the dependee fails or slows down, it collects a deficit (work expected by the dependent but not delivered). Once the dependee recovers the dependent has to do its own recovery. In their case, the data has to be recovered all at once (granularity problem). The older data is prioritized over newer data for recovery purposes (thermal mismatch)... and there was more than expected (expectation mismatch).
They had to add capacity, manually paced and prioritized recovery traffic, and did recovery shedding (and notified customers about it).
What helps?
- Resources must match replay requirements, especially end to end
- Stream shaping to spread out the rework, have a special recovery rate, or both
- Decide to cut or delay rework
- Design for incremental recovery end-to-end
- In general, avoid the mismatches
Longer term look into the operational and design considerations to prevent the problem from happening to begin with.
Wellbeing and Burnout
[No slides | No video]
Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.
Is your on-call rotation sustainable, or is it just "fine for now"? What happens after the incident is resolved and the adrenaline wears off? How do we avoid praising heroics but ignoring the toll it takes on the humans behind the screens? After the afternoon break, facilitators Beth Adele Long from Adaptive Capacity Labs and Sarah Butt from Salesforce for an open and honest discussion about wellness, burnout, and caring for the humans in our systems.
We had four discussion topics:
- Burnout:
- What does burnout mean to you?
- What are signs you're near burnout?
- Have you ever recovered from burnout?
- What strategies helped you?
- Did you make major lifestyle changes?
- How are stress and burnout related?
- What "unmet needs" do you have? How does that manifest as stress in your life?
- Individual Wellbeing:
- How have you been able to build capacity to handle more stress?
- What proactive wellness strategies have you found useful?
- What tends to threaten your wellbeing? What are early warning signals?
- What helps you recover after a big incident? After a tough on-call rotation?
- Team Wellbeing:
- Does your team openly discuss wellbeing? How do you cultivate an environment where people feel comfortable sharing on that topic?
- What strategies have supported your team's wellbeing, especially for high-stress operational roles?
- What does a sustainable on-call rotation look like to you? What doesn't work?
- What warning signs do you look for to know your team's wellbeing might be threatened? How do you monitor that?
- Organizational Wellbeing:
- What have you seen support psychological safety in an organization? What undermines it?
- What practical strategies support the wellbeing of responders on long-running or stressful incidents?
- How do you help reduce anxiety about incident retrospectives?
- If you notice someone who feels distressed during an incident, how would you help them?
Each table discussed one then reported out to the group, and then we shuffled and had another discussion with other people but without a second report-out.
The Gashlycrumb Tinies of AI Networking You Must Know (or Languish!)
[Direct slides link | Direct video link]
As the worlds of AI workloads and SRE practices converge, more and more teams are expected to ramp up on a whole new vocabulary. Terms like "NCCL", "PXN", "MoE" and "queue pairs" are thrown around, yet these concepts can remain elusive, buried in dense academic papers or scattered vendor documentation.
Inspired by Edward Gorey's whimsical illustrated alphabets, Lerna Ekmekcioglu's talk provided a structured, approachable tour through essential AI networking concepts. She demystified the specialized terminology and building blocks of AI networking, explained the concepts in clear language with practical context and a touch of darkly playful illustration to help the concepts stick.
AI in SRE
[No slides | No video]
Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.
After the morning break I went to the discussion session on AI in SRE facilitated by Courtney Nash of The VOID and Robbie Ostrow of OpenAI. From writing scripts to analyzing incidents, AI promises to change how SREs work. But how much of that is practical, and how much is vaporware? Attendees engaged with other conference participants to talk through what is actually working (real tools and use cases), what is failing (where it creates complexity and noise), and the messy reality of interacting with non-deterministic systems while supporting production environments.
Infrastructure Management
[No slides | No video]
Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.
After lunch I stayed in the discussion track for the Infrastructure Management session.
Whether you're wrestling with the eternal "buy vs. build" debate, fighting configuration drift, or figuring out how to scale without setting money on fire, there's a seat for you in this loose, open gathering with facilitators Clint Byrum from HashiCorp and Chris Jones from Google. Conversion went from wrangling Kubernetes to the depths of networking and anything in between in an engaging, community conversation across multiple tables with space to cover the full spectrum of infrastructure challenges.
We had two discussions. My first group discussed build versus buy, with the consensus being "It depends." What are your requirements to purchase or lease, and to support it long-term? Does something exist that meets them? If you build it, how will you support it as times change?
My second group discussed community and how to create, develop, maintain, and grow one. The general consensus is that communities, especially online communities, move as needed (for example, from Usenet and IRC to email to social media like Facebook or X FKA Twitter or even to Discord and Slack).
Plenary: Reliability Equilibrium: The Hidden Playbook behind SRE Influence
[Direct slides link | Direct video link]
Reliability and velocity often feel like opposing forces - but what if we treat them as strategic games? Microsoft Azure's Daria Barteneva talk reframed sociotechnical trade-offs through a game theory lens, using Nash equilibria, Stag Hunt, Public Goods, and Shapley value to model real-world SRE dynamics.
We explored why, without shared decision models, teams default to fragile equilibria like "freeze all changes," and how mechanism design — error budgets, canary deployments, and progressive rollouts — can shift incentives toward safer, higher-utility outcomes.
Grounded in SRE practice and backed by DevOps research, this session was intended to equip us to diagnose bad equilibria, design guardrails, and influence system-level behavior — not just symptoms, and to learn how to apply cooperative and non-cooperative game theory to reliability engineering and craft strategies that scale across teams, products, and platforms.
Plenary: The Power of Stories
[Direct slides link | Direct video link]
We humans love stories, so much so that we both tell them and listen to them for fun. As SREs, our community has a particular fondness for incident stories. In his talk, Lorin Hochstein, Staff Software Engineer for Reliability at Airbnb, discussed how these kinds of incident stories make for a more effective tool for learning from incidents than bullet points or metric trends. We saw how stories provide us with glimpses into the complexity of our system that we'd otherwise never see, and enable us to learn from the experiences of others. We explored what makes for an effective story, as well the dangers of stories that oversimplify the nature of complex system failure. And we looked at how to foster an internal incident storytelling culture within an organization.
Loren is a regimented order muppet type, not a fluid chaos muppet type.
Asking "How could this have happened?" is a common human response to (e.g.) an incident. Stories are how humans make sense of events like those. They're often memorable, more than a short bullet-point list.
The biggest problem in software engineering is how to get the right information into the heads of the people who need it. (This ties into Daria's note about game theory too.)
Stories are useful in the context of incidents. "Only a fool learns from his own mistakes. The wise man learns from the mistakes of others." (Attributed to Otto von Bismarck)
Direct experience is the best way to learn. The next best way is to learn through someone else and their experience: Watch someone else's expertise in action (e.g., shoulder-surf the troubleshooting back in the Before Times). Can't watch in real time? Tell a story.
A good story for social science has two criteria: It has to be anomalous or unexpected ("something's wrong") and it has to be immutable (important details are preserved). As an example, look at the Therac-25 story from the 1980s. It involved race conditions but also a terrible UI that let the operator make mistakes too easily.
There are different kinds or styles of incident stories you can tell. His favorite is the horror story ("And then after we failed over... the problem followed us into the other region!"). Another common one is the mystery ("But nothing has changed!"). His least favorite is the morality tale ("And because the failing test was ignored, the bad code made it to production!"), in part because they can be dangerously oversimplified. The solution or antidote to bad initial stories is second stories (followups).
Stories are never the whole truth; the story can be told from various different perspectives. For example, Richard Feynman believed that engineers and managers had different views on the risks involved in the Challenger explosion. Edward Tufte thought it was bad data visualization. Diane Vaughan noted that deviance was normalized.
These are three different views of the same event.
Takeaway: When you write up your incident retrospectives or post mortems, tell it as a story. Use a "Narrative description" as a section that describes how the events of the incident unfolded over time (chronologically). You can also chunk events into "episodes" rather than a single big timeline.
If you can't use the incident review for it, you can have an incident storytelling session independent of it. They call it "Once Upon an Incident" and it's like a campfire storytelling session.
Like so many other things, storytelling is a skill. Most of us can receive but not everyone can tell them. You'll get better with practice. Fortunately if you're involved in incidents you'll have lots of source material.
Go out and tell a good story.
Closing Remarks
[No slides | No video]
Conference chairs Patrick Cable and Laura Maguire closed the conference by once again thanking everyone involved, from coordinators to speakers to attendees.