Josh Work Professional Organizations Trip Reports Conference Report: 2025 SREcon Americas

The following document is intended as the general trip report for me at the 2025 SREcon Conference held in person in Seattle QA from March 24–26, 2025. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.


Monday, March 23

Travel day! As usual I managed to wake up before the alarm. I'd showered the night before so I packed up the toiletries, CPAP, laptop, and phone; set the thermostat to heat to only 60F; loaded the car; and drove off to Detroit Metropolitan Airport.

Traffic was moving at or above posted speeds. The bus to the McNamara terminal took a little while to arrive. Once I got to the right terminal I did self-serve bag tagging and dropped off the bag.

Badge pickup opened at 5:00 p.m. and mine printed after I scanned the QR code they'd sent. There was a welcome reception from then until 7:00 p.m. with nibblies (mostly crackers, cheese, and nuts).

After the reception I headed back to the room to write up some of the trip report and crash.

 


Tuesday, March 24

Opening Remarks

[No slides | No video]

Despite my challenge for the co-chairs to be the first in recorded history to begin the opening remarks on time at 8:45 a.m., the tradition of a late start continued; they didn't start until nearly 8:47 a.m. Nevertheless, the conference began with the usual remarks from the co-chairs, Patrick Cable from DraftKings and Laura Maguire from Trace Cognitive Engineering. They reviewed the code of conduct; we wanted to be a safe environment for everyone. The Slack channels are where changes are announced and the talks' Q&A will take place (#26amer- was the prefix and we used the :question: emoji (❓) as the question prefix; moderators asked them on our behalf.

Thanks to the program committee, track co-chairs, and room captains for putting together the program; the USENIX conference office for all the logistics; and the sponsors (the showcase will be open all three days, including a happy-hour this evening). Birds-of-a-Feather sessions (BOFs) are tonight and tomorrow.

We have 814 people in attendance (up significantly from around 550 last year and around 600 in 2024), 81% of whom are new to SREcon, and only 33% of whom are here with one or more colleagues, representing 17 countries and 274 institutions.

Plenary: Taming the Unpredictable: Reliability in Chaos

[Direct slides link | Direct video link]

Michelle Brush, Engineering Director at Google, said that the increasing volume of code written by AI will only accelerate the complexity of our systems. We are moving beyond predictable systems that can be managed with traditional methods like thorough project plans, runbooks, and unit tests, into an era of truly complex systems that are vast and difficult to comprehend fully. These immensely complex systems will behave almost nondeterministically. We need new strategies.

Her presentation delved into why robust reliability practices are not just helpful but essential for navigating this explosion in complexity. It shared strategies for conquering unpredictability including building rigorous evaluations, implementing generic mitigations, recreating reproducibility, and developing software with a risk-first approach. Most importantly, it discussed why humans will always be critical to taming the ever-growing complexity.

Our job as SREs is to worry about how things will fail and recover. Any line of code can be load bearing, even if written by LLMs/AI, with all the assumptions the engineer made when they wrote it (which may be valid at code time but not still valid months or years later).

Systems will get bigger and more complex# faster than ever. That leads to having to deal with emergent behaviour in systems and organizations, making it harder to understand cause and effect, or why something happened. This can feel nondeterministic or chaotic.

As systems fail we document undocumented assumptions and rewrite policy, causing the systems to grow, and eventually the inner "small" system is a black box.

We can ask agents to find and fix things. For example, a playbook may contain outdated commands, so we could run an agent across all our playbooks to fix them all at once.

Treat the complex work as hypotheses/experiments. Our jobs will move from being the human feedback loop to building feedback loops that scale with change.

Test your generic mitigations (e.g., turn it off and on again) as if they're still fitness functions, Other fitness examples:

What isn't changing? We still need to be the people to show up, respond when things fail, and get things working again.

Plenary: Mean Time to WTF: Why Developer Experience Frameworks Belong in Your Incident Retrospectives

[Direct slides link | Direct video link]

Dr. Nicole Forsgren defined SREs as developers who happen to focus on reliability. You write code, build tools, manage complex workflows, and experience friction every single day. That friction has a measurable cost: slower incident response, more mistakes under pressure, higher on-call burden, and ultimately, degraded system reliability.

She also wanted to address the elephant: AI is generating way more code, leading to more deployments and bigger changes to your systems. If your operational friction is already high, AI is about to amplify it exponentially.

This talk applied developer experience frameworks (like SPACE and DORA) to operational work. She discussed how to measure SRE experience, connect it to reliability outcomes, and identify high-leverage changes to reduce friction. We could walk away with practical things to do tomorrow and a framework for making the business case that investing in SRE experience is a reliability strategy.

Friction (error under pressure) is a reliability risk. Failures — like cognitive overload, tool failures, and process bottlenecks — can affect every incident the SRE is working on. She used the 2012 $440M loss from Knight Capital Group as an example. Is observability built into your new systems?

Technology is easy; people are hard. Technology moves fast... but people don't.

AI multiplies friction: Deployment frequency is up, incident rate also goes up, and on-call pressure compounds. AI can make our friction more expensive (for example, deleting a database despite being told not to). It changes the incident types we see. AI-generated systems are opaque by default (as opposed to human-understandable because they were human-written). Runbooks accounted for known failure patterns, but as systems change there are more and unique failure modes. diff output was readable in the past to see what changed, but now it's hard to reason about 1000+ line diffs under pressure.

The WTF has to be visible (if not measurable) to be fixable. "Mean Time to WTF" is a new metric: How long does it take to go from receiving an alert to "I understand what's happening." It can act like a leading indicator of how brittle a system is. Some low-cost high-signal measurement starting points are:

Now that we have an understanding and idea of our opportunities we can make the case for funding (timing: do it near an incident so we're not invisible). CISQ found $1.52T in the US for annual technical debt. Developers feel about 68% productive; the remaining 32% is about $300B in lost GDP. It's not "make the developer happy" it's about "staying competitive."

Productivity metrics can be gamed; experience metrics less so. It's not about measuring output but removing obstacles.

Stakeholders will often fund reliably if properly framed. They're less likely to fund happiness. To make the business case:

One example is Amazon's Dave Anderson. He had the mandate and data but no authority to change the VPs' roadmaps or priorities. He generated monthly S-Team reports that included the VP or director in charge of where there were gaps. Directors and VPs wanted to get off the report and adjusted their roadmaps and priorities to do so.

Some tasks we can do right now:

Also:

Varieties of SRE

[No slides | No video]

Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.

After the morning break, Kurt Andersen and Sebastian Vietz spoke about the varieties of SREs. Ask 10 people what SRE is, and you'll get 11 answers. There is no single "correct" way to do SRE. From embedded engineers to centralized consulting teams, every organization adapts the core concepts to fit their reality. This unconference discussion session invited us to peel back the label and examine the reality of the role. This was a facilitated discussion about the different flavors of SRE, how our organizations define it, and we debated whether the title still fits the work we do today.

Goals: Make one connection with someone at the table and take one insight away.

We had eight 10-top tables in a mostly-full room, and ran through three exercises:

Executing Chaos Engineering in Production at a Critical Financial Institution

[Direct slides link | Direct video link]

After lunch I went to Leonardo Marques' talk about Chaos Engineering. He wanted us to discover how Chaos Engineering transformed a high-stakes financial ecosystem at Bradesco processing thousands of transactions per second. This real-world case study unveiled a reproducible framework for risk-averse organizations, blending fault injection, automation, and observability.

Key takeaways included safe experiment design with governance guardrails, automated chaos workflows, and multidisciplinary GameDays. Results: 73% reduction in MTTD, 10 hidden vulnerabilities exposed, five new metrics, and a shift to proactive reliability.

He provided a compliance-friendly methodology to turn failures into insights, bridging theory and measurable business impact in critical systems.

From Thundering Herd to Zero Outages: Building Reliable Inventory Sync

[Direct slides link | Direct video link]

Rushikesh Ghatpande from Broadcom (VMware before the buyout) said that managing accurate inventory across distributed infrastructure is critical for security policy enforcement and operational reliability. Enterprise datacenter software requires centralized policy management across thousands of servers and hundreds of thousands of VMs and containers, yet resources are distributed across multiple data centers.

This talk shared a battle-tested inventory synchronization protocol that evolved over six years of production experience, handling real-world challenges from thundering herd problems during full datacenter restarts to fairness in queue processing. The protocol uses a 5-stage finite state machine to ensure reliable, consistent inventory sync while preventing system overload.

He explained how they evolved over two years from a naive 3-step process to a robust 5-stage protocol, how they solved the thundering herd problem, ensured fairness in queue processing, and separated connection establishment from application readiness. He shared empirical analysis that led to specific timeout values and demonstrated how bidirectional communication patterns eliminated message ordering complexity.

This protocol has been validated at scale across 10,000+ servers with zero customer escalations over four years. The patterns are immediately applicable to any distributed state synchronization challenge — whether managing VMs, containers, network devices, or any distributed resources.

From ESXi to Kubernetes at the Edge: Modernizing 1,400 Edge Locations with Open Source

[Direct slides link | Direct video link]

After the afternoon break, Jasjit Singh and Vishal Jethnani, both from Loblaw Companies Limited, Canada's largest food and pharmacy retailer, talked about their migration from ESXi to Kubernetes. When virtualization licensing costs increased tenfold, a large retail organization rethought how it ran infrastructure across 1,400 edge locations. This session shared how existing store hardware was transformed from a VMWare-based setup into a fully open-source, Kubernetes-powered platform - without adding new hardware or disrupting business operations. They explained how ESXi was replaced with Kubernetes clusters, centrally managed using GitOps with FluxCD, and how open-source tools like Dex, Helm, Grafana, and KubeVirt were used to deploy and observe workloads at scale. The talk focused on practical SRE lessons around automation, resilience at the edge, and cost-efficient modernization that attendees could apply to their own distributed environments.

Before: A typical store had a self-contained ESXi environment of two nodes, supporting the pharmacy and POS applications and DB2, with shared storage. They were monitored by IBM's ITCAM. It was optimized for stability not flexibility. The migration required touching every store.

Constraints: No budget, didn't want proprietary software, and no tolerance for downtime. Also, the Pharmacy Management System was in-house legacy code, and they had no time to modernize apps to microservices. Edge realities meant limited compute and storage, constrained network connectivity, no on-site support, failures must be recoverable remotely, and every store had to operate independently.

Answer: Simplify, eliminate custom code, and go open source: OpenSUSE as the foundation OS, with Kubernetes on top of it, KubrVirt for VM management, and the applications sat on top of that. Every store is now a small, declaratively-managed data center.

The migrations took place in four steps, each of which was reversible:

  1. Pre-cutover — Introduce a third node in the store (whose prior workload was migrated to the cloud), then set it up with Kubernetes to it. Move individual components over and modernize, validate workloads, move some of the applications to microservices to reduce dependencies.

  2. Cutover night — After the store closed, consolidate ESXi workloads on one node, expand the Kubernetes cluster to a 2-node cluster, sync data, switch IP addresses. (The old VMs on the shut-down ESXi node remained.)

  3. Safe end state — Apps are on Kubernetes and the remaining ESXi node was intact.

  4. Day-2 state — Once the store was happy, convert the final ESXi node to a third Kubernetes node in the cluster.

How did they scale to 1,400? A small (5-person) focused team, a TPM, and four deployment specialists. Having repeatable automated processes was essential. They started slow with one store per night, refining run books and automation. Once they were happy they went to 10 stores per night. At peak they were doing 40 stores per week.

The impacts were measurable: No OS, virtualization, and monitoring costs and using a standard platform, supporting modernization without another migration.

Key takeaways: Apply SRE principles at the edge, make small architectural bets for the large-scale impact, and have a framework for modernization readiness. Modernization isn't about new tools but reducing operational surface area. Execution is more important than the technology stack.

Beyond Blanket Freezes: Enabling Safe Innovation During Critical Events at Netflix

[Direct slides link | Direct video link]

Prachi Jain and Sandhya Narayan shared how Netflix is replacing blanket deployment freezes, a common safety lever during high-risk periods that they often slow teams down and create bottlenecks, with data-driven, service-specific risk management that enables safe, continuous delivery even during critical events. They discussed how to classify services by risk, integrate this into CI/CD, and empower teams to ship quickly without sacrificing reliability.

Not all systems play the same role. For example, customer-facing systems are riskier to update than internal-only systems. An all-or-nothing freeze can delay security updates and stack critical updates (bug fixes and feature requests) in a backlog that all deploys once the freeze lifts, making it harder to identify any breaking changes.

Rather than an all-or-nothing freeze they categorize based on risks:

Classify risk so not everything is treated as being as important as the play button.

The four risk signals are:

This gives a more nuanced decision as to risk (low, medium, high) to reach a better decision.

Bypassing a freeze becomes a business decision: Look at event type, service tier, risk signals, and resilience tactics and compare it against the rubric. It's built into the deployment pipeline for consistent, data-driven decisions.

That leads to a feedback loop: After big events they review everything to see what worked, what broke, and what was unnecessarily stressful. Was it mis-tiered? Did the risk signals fire too often or not often enough? Wire any changes into the CI/CD workflow rubrics.

In summary: Make intelligent risk- and data-based decisions. Automate the risk signals (and maybe include a security flag). Adopt essential resilience tactics to detect early before a regional or global outage. Listen to your team. Start small and incrementally build.

Reliability in the Big Leagues: How SRE Powers America's Pastime

[Direct slides link | Direct video link]

Jessica Johnson and Chris Alexander said that in professional sports a "timeout" is strategic, but "downtime" is a disaster. At Major League Baseball, SRE covers the whole field — ensuring real-time data integrity for millions of fans and powering the critical on-field technology that impacts every play. This session went beyond the dugout to reveal how they built a Major League SRE practice from the ground up. They share their journey of rebuilding trust through psychological safety, how they "score the game" using advanced SLOs, and how they are now moving to the offensive as we approach Opening Day. They gave us a look at the unique hurdles of sports tech and shared a gameplan for fielding a championship-caliber SRE team.

Building reliable and resilient systems is a shared goal; the SRE team is just one piece of the puzzle. They work closely with the platform team. The talk is about their reliability challenges and technologies.

MLB operates at the intersection of live entertainment and high-scale technology, servicing a massive interconnected ecosystem. Their technology innovations include:

Reliability hurdles included:

How did we get here?

What is (and isn't) SRE and MLB? They wear many hats: engineers, evangelists, educators, change agents, and consultants, but they are not relievers or other teams' on-call support, Their vision is to improve the fan and partner experience by making reliability and performance bedrock features of all their technologies. They have four core values: integrity, leadership, collaboration, and empathy.

They track progress with blameless incident metrics. They needed standardization and data quality. You need context to figure out the "why."

What's coming up?

Key takeaways:

 


Wednesday, March 25

Ghosts in the Interview Loop and Avoiding AI Taylorism

[Direct slides link | Direct video link]

Andrew Hatch from Cisco ThousandEyes said that AI tools are now living in our interview loop, reducing once rigorous SRE coding interviews to something solved by an LLM in seconds. Remote interviewing only makes cheating easier, harder to detect and System Design interviews are fast-following as tools grow in sophistication.

In 2025 ThousandEyes collided with this reality. AI-assisted cheating, broken coding rounds, requiring them to pivot to take-home challenges that test thinking, not just output. This talk discussed interviewing and the emerging threat these tools are leading us to — a new dawn of Taylorism, albeit wrapped in cheerful Agentic/LLM chatter (as opposed to checklists and a stopwatch), reducing skilled engineers to mindless prompt operators, eroding expertise, lowering morale and, fragile systems and codebases drowning in a sea of AI slop.

He weaved humor, critique and lived experience in his talk to encourage debate on what we are hiring for in this new era of our industry.

Hiring used to involve some kind of coding, networking, hardware, distributed systems design, the human factors, and incident management. They were almost always in person, at least in part. You could gauge how they thought, how personable they are, how shifty they are, is who he says he is, and so on. Often there would be some phone or Zoom screens, then he had to fly from AU to California to have six hours of back-to-back in-person interviews.

And then there was COVID. But we kept doing the same process, but it was entirely online. And now we have AI, so the previous structured interviews don't work as well: when they got to the coding round, strange and unexpected things would happen. They'd say "Please don't use AI tooling for the coding questions." The candidates would be idle for 10 minutes but their eyes would move. They weren't typing on the screen. That made the interviewers suspicious... and then the code came out perfectly because they put the code into ChatGPT.

They changed their interview: Watched the movements, ran problems through multiple AI tools, and then sent them off with a take-home assignment.

This needed more thought:

The irony is that there's a lot of top-down pressure to use AI, and no-one wants to be left behind. But telling candidates not to use it but then demand they do on day 1, how do we access actual expertise? They invest hard for expertise and don't want to engineer it out of the role. Also, what happens to creativity, innovation, and critical thinking? They don't just want technical skills but for those as well.

This leads to the war on expertise. We've engineered it out for centuries. The printing press is a great example: Only a few could write calligraphy, but the printing press spread literacy to the world. The spinning Jenny, arc welder, power mitre saw, and GPS navigation are all tools that made things more efficient. But our distributed complex system expertise is different [dynamic stochastic emergent unstable non-linear heterogeneous evolving context-sensitive, and so on].

Complexity is hard. We simplify. The 5 Whys MTTR metric is garbage; it's a way to diagnose an issue... some of the time. It's good for simplifying. We rely on simplified versions of complexity for a long time, and it does help... but someone (like SREs) still needs to understand the complexity.

What if something could reduce the reliance on expertise... like AI?

It's not really a silver bullet. AI can handle the simple case but not the actual complexity. See Taylor's 1911 The Principles of Scientific Management. A lot's been written about how separating thinking from doing isn't really workable in the long run. If we blindly accept what the technology says to do we may have problems.

Taylorism works for linear repeatable processes (if you don't care about worker engagement, happiness, or innovation). That doesn't describe modern software systems, and hasn't for decades, especially with the real life complexity.

AI is really good at generating plausible bullets. The cost is the difference between apparent certainty and actual understanding will continue to grow, leading to a true knowledge gap.

AI gives many workers the illusion of expertise. We're getting a lot of pull requests that are junk, turning infrastructure-as-code into a giant bucket of slop. We've seen cognitive offloading in schools: by not building mental models and delegating cognition to a tool means you'll do worse on the exam in a few months. But learning requires real world experience; it's messy and it breaks, but it's actual learning. Understanding comes through trial and error, adaptation at the edge, and testing. This is the fundamental mindset of good SREs.

When does AI stop being a tool and become the operator? And how can we recognize this? We need to know when the tool is helping versus just giving us the answer.

Who's responsible when the operator (now AI) is hallucinating? And who has the expertise to fix it (especially since we outsourced all the expertise to AI)?

Complex System control planes become as complex as the systems they represent (even if they look simple).

SREs still need complex system skills:

AI is a great tool... but it's only a tool. It must not be the operator lest it enshittify the production system you're accountable for. We have to adapt:

Humans are and have always been the most adaptable components of any complex socio-technical system. The ability to learn and build expertise is critical.

So You Want a New Incident Commander-Lessons from Building Incident Response Teams

[Direct slides link | Direct video link]

Vanessa Huerta Granda, Technology Manager for Resilience Engineering at Enova International, spoke on Incident Command (IC). It isn't a badge for the most senior engineer but rather a sociotechnical leadership skill that keeps teams aligned, reduces cognitive load, and builds trust during outages. This talk shared lessons from a decade building IC programs across SRE organizations; including how to identify, train, and support effective Incident Commanders without burning out your best responders.

Vanessa was the only incident commander and the only one doing post-mortems at her company. It was fun at the time, but looking back not so much: She was a single point of failure. How do you dig your way out from that, especially given that incident commander is a socio-technical role. Their incident response system is different with many trained commanders. Incidents are calmer, communications clearer, and no single points of failure. It took time because they had to rethink the role.

Incident command is a sociotechnical role for a sociotechnical problem... or when humans interact with a complex system under pressure. Incidents often start as a technical failure, but it can quickly become an organizational event with different people wanting updates or suggesting changes or communicating different things to different people. You're coordinating a group of humans interacting with a complex system under pressure.

Why do we need them? To create the conditions for the technical experts to do their work effectively. They coordinate the people, information, and decisions so the engineers can focus on the remediation. They're more like an orchestra conductor: they keep the tempo, making the performance work as a whole, despite not playing the brass and the strings.

The incident commander role sits at the intersection of people (focus and avoid duplicate work, communication, and decision flow), systems (shared situational awareness), and the business (knowing what matters to the organization: financials, customer service, regulatory and compliance, and so on).

Anti-patterns: The commander doesn't need to be (and in fact likely shouldn't be) the strongest engineer (who should be working on the problem), anyone on-call (the incident commander is a skill not an operational responsibility), or the most senior person (who can be perceived as scary and who may have strong opinions about technical or organizational solutions). The incident commander should be facilitating the discussion.

So what should we look for? Skills in communication, sociotechnical leadership, and cognitive load management. And you need more than one person!

How do you convince leadership to do this? It's a business decision, so speak business: what's the cost of uncoordinated events (not just to the bottom line of revenue, but the engineering cost), evidence for what happened in events with and without an incident commander, and start small and prove your value (you don't need five new headcount on day one).

How do you build a sustainable incident response team?

The goal is to build sustainable capacity over time, not to be perfect.

Epistemology of Incidents and Problem Solving

[Direct slides link | Direct video link]

After the morning break SRE Jack Kingsman from Atlassian spoke about incidents. In high-pressure incident response, critical fundamentals of thought, action, and communication matter more than ever. Engineers need concrete criteria and examples of how to think and problem-solve during incidents, and answers to big questions: What fundamental decision-making loops should we orient ourselves in when disaster strikes? How do we reason about trapping the location and cause of unknown issues? What specific qualities make a hypothesis good or worthwhile, and how do we construct effective tests to prove or disprove them? How can we structure notes and progress updates to provide the most signal and the least noise in fast-paced situations?

Drawn from nearly a decade of experience in Site Reliability Engineering, from small startup to publicly traded SaaS firm, this talk helped attendees level up how we think and act when it matters, and equipped us with the concepts to teach those skills to others.

The core of good incident management is knowledge. This talk was about how we know things (epistemology is the study of knowledge):

Takeaway: Take away a new paradigm or script for yourself or your teams.

An incident loop gives a road map about what to do at a given stage.

Phase 0 is the detection and declaration of an incident. It boils down to two universal things:

Phase 1 is survival and triage. Keep the boat afloat (survive first): Buy time in reversible ways: Rollbacks, scaleups, circuit breakers, access denial, load shift, fall over, etc.

Phase 2 is examination. Gather evidence, organize our thoughts and observations, laying down the right information in the right way to make the solution obvious:

Can AI help in a mechanical sense (does the collected facts imply something useful to the diagnosis/hyphothesize step)?

Phase 3 is diagnosis and hypotheses:

Can AI help in a mechanical sense (what sense can it make for diagnosis)?

Phase 4 is testing and treating. What makes a good test?

Track your results. Data is gold, even the failures. Put it in the Slack war room, your incident state document, your own timestamped notes. (You'll need it to bring new people up to speed and for the post-mortem.)

Be proactive. When you're waiting on a test result, don't just twiddle your thumbs. Stop and think about the outcomes about what you'll do next. What if it succeeds? What if it fails? Who will you bring into the war room? What will you do next? Consider doing it now.

Can AI help in a mechanical sense (develop a test)?

Incident communication skills include:

Incidents are all about knowledge: What we know, how to get that knowledge, and possibly asymmetric knowledge.

Note: This is not prescriptive, but a good framework.

Human Factors in the Age of AI Ops: Re-Engineering Trust between Humans and Machines

[Direct slides link | Direct video link]

Eddie Redick from CTC Ops was up next. When everything fails at once... cascading service degradation, overlapping automations, and an over-eager AI auto-remediator — you don't rise to the level of your architecture; you fall to the level of your systems thinking.

As AI and automation become deeply woven into the fabric of reliability engineering, teams are learning that convergence isn't just technical — it's cultural, cognitive, and procedural. What happens when human intuition collides with machine logic in the middle of a P1?

He has witnessed countless times where precious outage minutes are wasted chasing false positives. AI is garbage-in/out, or, as he always says, "Only as smart as you feed it."

Engineers can spend countless Scrum hours planning and building the best next-gen tool, only to have it fall flat. Non-structured data is tricky. If not architected effectively or starved, it will waste away its ROI.

Did the AI auto-remediator make it worse? Were you chasing false positives? Are there cascading degradations? What if there are overlapping (and possibly contradictory) automations?

There's a gap between adoption velocity and building trust. 68% of organizations want AI but only 16% trust it. 62% say that trust has increased over the past year. Fewer than 1/5 fully trust AI to act autonomously.

The trust triangle is between logic, empathy, and authenticity, but:

For example: AI softens an image, then changes the background (and the person), then revamps the whole image... badly, and all without telling you.

65% of employees think they're building on solid data, but are they?

75% of leaders think their teams need more training.

Tech debt is a silent killer; old once-accurate docs may no longer be. What if archives get pulled into the current production environment.

Alert fatigue is another silent killer: 73% of alerts are false positives (up a lot in the past year) -- more AI, more automation, and more noise. 30% of those are not even investigated.

Urgency needs to be appropriately scope; urgency x impact should inform the engagement model, so you don't run a P3 alert as if it were really a P1.

Toil is increasing (in 2025 for the first time in 5 years):

Commanding the chaos (CTC) is his framework for resilience when humans and machines converge:

Anyone can run a DRP TTE... but could you run one for real? Even as a drill?

Need to avoid reactive mode — chasing each alert individually, letting the auto-remediator run unchecked, assuming the AI knows best, skip correlation and go straight to fix, panic-driven decisions — to systems thinking mode — mapping the blast radius first, correlating signals before acting, having a human validate the AI hypothesis, having a single incident commandeer own the process, and understanding dependencies before acting.

80/20 rule: 20% is technology (tools, platforms, AI models, automation, and so on) and 80% is redesigning the work. You can't just drop a new tool on a broken process; the divergence can involve culture, process, people, and governance. Shiny and new is fun, but be intentional about it.

Building a trust bridge involves:

How is trust earned?

Playbook to reengineer trust:

  1. Audit your signal to noise ratio in alerts.
  2. Define your AI autonomy levels. What phase/s are you in?
  3. Build human override protocols.
  4. Invest in systems thinking training.
  5. Measure trust, not just MTTR. (This and MTWTF are new useful metrics.)

Takeaways:

From Chaos to Confidence: How SREs Can Leverage 50 (and Counting) Failure Scenarios to Test AI Readiness

[Direct slides link | Direct video link]

After lunch, Rohan R. Arora and Bhavya from IBM Research talked about how they used sandboxed Kubernetes environments to create over 50 production-inspired failure scenarios that put AI assistants to the test across the full SRE toolkit. The results? Current AI models like ReAct resolve only 13.8% of scenarios — a reality check for anyone evaluating these tools. This session introduced their evaluation framework and showed how they use it to benchmark AI assistants against real failure patterns, chaos-test their own applications with production- inspired scenarios, and assess whether AI-assisted approaches fit their operational needs. They're building a community-driven repository where SREs contribute real incidents and advance the field together. They wanted us to learn what AI can (and can't) do today — and to help shape what it could do tomorrow.

AI makes a lot of marketing promises... which don't come true. They claim it's 80-90% faster, but they're at best 42% accuracy with 30% more operational toil. LLM context is key; throwing data at LLMs isn't useful, especially when observability data is noisy or irrelevant. The bottleneck is context quality, not model capacity. Garbage in gives confident garbage out. If you can't replay a fault you can't measure progress. Synthetic tests what really happens in production.

When the Cure Is Worse than the Disease: Metastability in Recovery

[Direct slides link | Direct video link]

Todd Porter from Meta and Aleksey Charapko from the University of New Hampshire were up next. Dealing with failures is an inevitable part of operating large distributed systems. Luckily, such systems are designed to handle failures and recover from their effects. In this talk, they explored the unfortunate cases in which recovery actions intended to address problems, unbeknownst to the operators, become the cause of even larger failures. This process occurs through natural recovery cascades in large systems, in which the recovery of one system or component triggers recovery in the next. They showed that, via recovery cascades, systems may amplify the recovery cost at each step as the process crosses from one system to another. Moreover, these amplifications can propagate backwards into the systems that have already recovered, creating positive feedback loops that reintroduce and reinforce the failure.

They explained their findings, failure causes and contributing factors, and mitigation strategies using a global-scale message bus that experienced such problems as an example.

They have a process and requirements. There was an incident. They attempted to recover but it didn't work as expected. They issued backfills to correct the corrupted data, but in this case things got worse and it eventually escalated to a widespread incident.

So why weren't the backfills actually working and letting them recover. That was a metastable failure — a performance failure sustained by a positive feedback loop, where the initial trigger started things off and led to an overload or exhaustion of some resource. "Retry" is a common one, often baked into the protocols; it added more and more retries which slowed things down more and more as the load went up and up. Eventual recovery may require literally turning it off and on again.

Given all that, we build fault-tolerant systems... but even their recovery mechanisms may be metastable. A retry is good for intermittent failures, but once it's overloaded the retries only make things worse and amplify the situation, leading to cascading failures.

Aleksy notes that the dependee provides work to a dependent system, and that dependent relies on work or data supplied by the dependee. When the dependee fails or slows down, it collects a deficit (work expected by the dependent but not delivered). Once the dependee recovers the dependent has to do its own recovery. In their case, the data has to be recovered all at once (granularity problem). The older data is prioritized over newer data for recovery purposes (thermal mismatch)... and there was more than expected (expectation mismatch).

They had to add capacity, manually paced and prioritized recovery traffic, and did recovery shedding (and notified customers about it).

What helps?

Longer term look into the operational and design considerations to prevent the problem from happening to begin with.

Wellbeing and Burnout

[No slides | No video]

Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.

Is your on-call rotation sustainable, or is it just "fine for now"? What happens after the incident is resolved and the adrenaline wears off? How do we avoid praising heroics but ignoring the toll it takes on the humans behind the screens? After the afternoon break, facilitators Beth Adele Long from Adaptive Capacity Labs and Sarah Butt from Salesforce for an open and honest discussion about wellness, burnout, and caring for the humans in our systems.

We had four discussion topics:

Each table discussed one then reported out to the group, and then we shuffled and had another discussion with other people but without a second report-out.

 


Thursday, March 26

The Gashlycrumb Tinies of AI Networking You Must Know (or Languish!)

[Direct slides link | Direct video link]

As the worlds of AI workloads and SRE practices converge, more and more teams are expected to ramp up on a whole new vocabulary. Terms like "NCCL", "PXN", "MoE" and "queue pairs" are thrown around, yet these concepts can remain elusive, buried in dense academic papers or scattered vendor documentation.

Inspired by Edward Gorey's whimsical illustrated alphabets, Lerna Ekmekcioglu's talk provided a structured, approachable tour through essential AI networking concepts. She demystified the specialized terminology and building blocks of AI networking, explained the concepts in clear language with practical context and a touch of darkly playful illustration to help the concepts stick.

AI in SRE

[No slides | No video]

Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.

After the morning break I went to the discussion session on AI in SRE facilitated by Courtney Nash of The VOID and Robbie Ostrow of OpenAI. From writing scripts to analyzing incidents, AI promises to change how SREs work. But how much of that is practical, and how much is vaporware? Attendees engaged with other conference participants to talk through what is actually working (real tools and use cases), what is failing (where it creates complexity and noise), and the messy reality of interacting with non-deterministic systems while supporting production environments.

Infrastructure Management

[No slides | No video]

Discussion track talks were ephemeral with no slides or video recordings and run under the Chatham House Rule which dictates that participants in a meeting are free to use information gathered, but cannot reveal the identity or affiliation of the speakers or any other participants.

After lunch I stayed in the discussion track for the Infrastructure Management session.

Whether you're wrestling with the eternal "buy vs. build" debate, fighting configuration drift, or figuring out how to scale without setting money on fire, there's a seat for you in this loose, open gathering with facilitators Clint Byrum from HashiCorp and Chris Jones from Google. Conversion went from wrangling Kubernetes to the depths of networking and anything in between in an engaging, community conversation across multiple tables with space to cover the full spectrum of infrastructure challenges.

We had two discussions. My first group discussed build versus buy, with the consensus being "It depends." What are your requirements to purchase or lease, and to support it long-term? Does something exist that meets them? If you build it, how will you support it as times change?

My second group discussed community and how to create, develop, maintain, and grow one. The general consensus is that communities, especially online communities, move as needed (for example, from Usenet and IRC to email to social media like Facebook or X FKA Twitter or even to Discord and Slack).

Plenary: Reliability Equilibrium: The Hidden Playbook behind SRE Influence

[Direct slides link | Direct video link]

Reliability and velocity often feel like opposing forces - but what if we treat them as strategic games? Microsoft Azure's Daria Barteneva talk reframed sociotechnical trade-offs through a game theory lens, using Nash equilibria, Stag Hunt, Public Goods, and Shapley value to model real-world SRE dynamics.

We explored why, without shared decision models, teams default to fragile equilibria like "freeze all changes," and how mechanism design — error budgets, canary deployments, and progressive rollouts — can shift incentives toward safer, higher-utility outcomes.

Grounded in SRE practice and backed by DevOps research, this session was intended to equip us to diagnose bad equilibria, design guardrails, and influence system-level behavior — not just symptoms, and to learn how to apply cooperative and non-cooperative game theory to reliability engineering and craft strategies that scale across teams, products, and platforms.

Plenary: The Power of Stories

[Direct slides link | Direct video link]

We humans love stories, so much so that we both tell them and listen to them for fun. As SREs, our community has a particular fondness for incident stories. In his talk, Lorin Hochstein, Staff Software Engineer for Reliability at Airbnb, discussed how these kinds of incident stories make for a more effective tool for learning from incidents than bullet points or metric trends. We saw how stories provide us with glimpses into the complexity of our system that we'd otherwise never see, and enable us to learn from the experiences of others. We explored what makes for an effective story, as well the dangers of stories that oversimplify the nature of complex system failure. And we looked at how to foster an internal incident storytelling culture within an organization.

Loren is a regimented order muppet type, not a fluid chaos muppet type.

Asking "How could this have happened?" is a common human response to (e.g.) an incident. Stories are how humans make sense of events like those. They're often memorable, more than a short bullet-point list.

The biggest problem in software engineering is how to get the right information into the heads of the people who need it. (This ties into Daria's note about game theory too.)

Stories are useful in the context of incidents. "Only a fool learns from his own mistakes. The wise man learns from the mistakes of others." (Attributed to Otto von Bismarck)

Direct experience is the best way to learn. The next best way is to learn through someone else and their experience: Watch someone else's expertise in action (e.g., shoulder-surf the troubleshooting back in the Before Times). Can't watch in real time? Tell a story.

A good story for social science has two criteria: It has to be anomalous or unexpected ("something's wrong") and it has to be immutable (important details are preserved). As an example, look at the Therac-25 story from the 1980s. It involved race conditions but also a terrible UI that let the operator make mistakes too easily.

There are different kinds or styles of incident stories you can tell. His favorite is the horror story ("And then after we failed over... the problem followed us into the other region!"). Another common one is the mystery ("But nothing has changed!"). His least favorite is the morality tale ("And because the failing test was ignored, the bad code made it to production!"), in part because they can be dangerously oversimplified. The solution or antidote to bad initial stories is second stories (followups).

Stories are never the whole truth; the story can be told from various different perspectives. For example, Richard Feynman believed that engineers and managers had different views on the risks involved in the Challenger explosion. Edward Tufte thought it was bad data visualization. Diane Vaughan noted that deviance was normalized.

These are three different views of the same event.

Takeaway: When you write up your incident retrospectives or post mortems, tell it as a story. Use a "Narrative description" as a section that describes how the events of the incident unfolded over time (chronologically). You can also chunk events into "episodes" rather than a single big timeline.

If you can't use the incident review for it, you can have an incident storytelling session independent of it. They call it "Once Upon an Incident" and it's like a campfire storytelling session.

Like so many other things, storytelling is a skill. Most of us can receive but not everyone can tell them. You'll get better with practice. Fortunately if you're involved in incidents you'll have lots of source material.

Go out and tell a good story.

Closing Remarks

[No slides | No video]

Conference chairs Patrick Cable and Laura Maguire closed the conference by once again thanking everyone involved, from coordinators to speakers to attendees.

 


Friday, March 27

 



Back to my conference reports page
Back to my professional organizations page
Back to my work page
Back to my home page

Last update Apr14/26 by Josh Simon (<jss@clock.org>).