The following document is intended as the general trip report for me at the 33rd Systems Administration Conference (LISA 2019) in Portland, OR, from October 28–30, 2018. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
Today's a 27-hour travel day! I was up early — the alarm was set for 4am and I was up by 3am anyhow. I showered, finished packing, made the bed, unloaded the dishwasher, set the thermostat lower, and hit the road shortly around 3:45am. There was virtually no traffic on the road to Metro airport (though the heavy rain squalls were not helpful on the darkened I-94), I found a parking space in my usual zone (3E4/3E5 of the Big Blue Deck) quickly, got to the terminal shuttle as one was waiting, and got to the terminal without any trouble. Curbside checkin wasn't open yet so I went to the Wheelchair Assistance line where there was actually an attendant waiting. We got my bag dropped off, through security — though this time I could take off the aircast (while seated) to send it through the x-ray machine instead of having it swiped down — and got to my gate way early. I read the magazine backlog and then listened to some music on the iPod.
Boarding began on-time, and I took advantage of the preboarding. Got myself settled, swapped out the aircast (stowed under the seat in front of me) for the shoe with the carbon-fiber plate insert, and had the first of my eventually-four kalimoxtos. We departed on time, had a 15-minute taxi-to-takeoff, and were told we had 4h20 in the air. Other than turbulence on takeoff (the rainstorms had mostly moved through but the choppiness remained), and another 4 bouts of turbulance in flight, it was a pretty uneventful flight.
In Portland my wheelchair assistant met me planeside — and was one of nine meeting the flight. We got to baggage claim and she pulled my bag off the conveyor for me. I stuffed the cpap into the formerly-checked bag and headed out to catch my 11am shuttle... which thanks to them having only one driver working (the others all no-showed or canceled, I gather) was more of an 11:45am shuttle. There was only one other passenger so the three of us had a discussion about pets, kids, and post offices (where the other passenger's worked for like 30 years) until we dropped her off at her hotel, then we talked business and crazy drivers until she dropped me at my hotel. My room wasn't ready so I bell-checked the bag, wandered a bit, and wound up eating a burger at the hotel restaurant. Jon and Lisa joined me, and we briefly chatted with Doug as he passed through. After lunch we went down to harrass the folks doing Build and basically hallway tracked it until a bit past 2:30pm when I got the call my room was ready.
I got my keys, got to the room, set the phone on the charger, unpacked, and generally got things ready for crashing at my eventual bedtime. Went back down to the Build area to schmooze more until badge pickup opened at 5pm, was third in line, and got my badge and bag-o'-stuff pretty quickly. A quick confab about dinner and then off to drop the bag in the room before heading out with Brian, Dan, and Matt to Mika Sushi a bit before 5:30pm.
The food, once it got to us, was very good. The fact that the chef — who was busting his ass filling orders — took over an hour to get us our food, was not. The staff — and I couldn't tell if it was front- or back-of-the-house — apparently prioritized the to-go/delivery orders, as well as the order of at least one if not two couples — and a mom-and-kid joining an already-finished four-top table, to ours. We got two free appetizers out of it with their apologies (pork spring rolls at the 0:45 point and shrimp tempura at the 1:00 point). We managed to finish and pay our bills by 7:15pm, where we headed back to the hotel. I swung by the beer and ice cream social pseudo-BOF for an ice cream sandwich of salted caramel ice cream between two chocolate cookies, some chatting, and then headed off to bed by 7:45pm since that was coming up on 20 hours awake.
Of course, between the room temperature (70F was too cold so I bumped it to 72), the new-bed syndrome, and the unexpected train whistles outside, I couldn't settle in enough to sleep until 10:30pm or so.
Opening Remarks
The conference proper began with the usual opening remarks: Thanks and notes (the build team, sponsors, program committee, liaisons, staff, and board). They stressed the code of conduct this year; we want to be a safe environment for everyone. We were reminded to go to the Birds of a Feather (BOF) sessions in the evenings, to go to the Lightning Talks session Tuesday night after the reception, and to go to LISA 2020 in Boston next year (December 7–9) which will be co-located with SREcon Americas East.
Keynote: The Container Operations Manual
The first (and very dynamic) keynote was Alice Goldfuss' "The Container Operator's Manual." Like car ads, which report numbers based on lab results but not the real world, container talks are all based on lab testing. Scalability doesn't happen as one might expect. She wants fewer Docker 101s and more Docker WTFs.
What's a container, really? Most talks say that it's smaller boxes on a daemon on an OS on a host... which is fine for humans but not for real. As far as computers are concerned, they're Ruby or Python container processes. Containers are really just processes born from tarballs anchored to namespaces controlled by cgroups. So writing a file for docker or whoever is building a recipe, and creates a tarball. cgroups define resource limits. But how does this work at scale?
She covered four lessons:
- Containers have strengths, mainly in stateless applications — an application that takes data in, changes it, and sends it out. No state is saved, no copy is local; it's stateless. This scales well because the service instances are for ephemeral services. They're portable and easy to upgrade and iterate. It also makes disaster recovery easier. They're good for testing environments.
- Containers have weaknesses. They're not good for things requiring state, including databases (unless you're at Google scale). Use a cloud provider, tooling, and multiple regions instead. Some try to containerize the database but have one storage copy (for example, Ceph)... which eventually gets network bound. Another way is to keep the data local via a mounted volume... which made testing and upgrades easier but required lots of custom tooling... and eventually gets network bound. If you have to containerize your database, keep it small, or use them temporarily while the hardware is on order. If you still want to do this, there's Vitess (MySQL on Kubernetes).
- Containers need friends. It's never "just" containers; they're ecosystems. Which is best for the problem we're trying to solve? How will you build the container tarballs (Docker)? How will you schedule resources (orchestration)? How will you manage the clusters (management, maintenance, upgrades, traffic flow, spin up/down, and so on)? How will you handle routing, access control, and service discovery (on different hosts, in different data centers, who should/n't have access, and so on)?
- Deployment — Same way or different? Who gets support first if both breaks?
- Monitoring — What new metrics do we need?
- Provisioning — How are the hosts provisioned? Are base images changing? Which takes precedence (old or new)? What about procurement?
- Debugging — How will you troubleshoot it? What about the developers running their apps? How can they know things are or aren't healthy?
This all requires a rolling upgrade to get up and running and stable... with all of one application. You'll probably wind up with a hybrid environment, where stateful is on bare metal and stateless is containerized.- Containers need headcount. "Just give it to Ops" is not a valid response. You're basically building a new DC in a city nobody's heard of on hardware you've never used. You need a new team... who knows your existing environment, deployments, technologies, and so on. They have to be able to write tooling to interact with your existing infrastructure. They have to understand monitoring, metrics, and observability. You need a kernel engineer because containers and the underlying host can (and often do) sometimes run with scissors. They need to understand networking and be able to figure out what's happening in the container space. They need to understand security, especially as to what containers provide out of the box and what they don't. They need to have good relationships with other teams, to have the social currency to burn to build out adoption, and to handle the crash-and-burn as things get fixed. And they need a project manager to keep everyone else on-track. You need multiple people to do all this, probably 6–8 people and no less than 4 people, because they'll eventually be oncall and you can't have a rotation with that few. They need empowerment to succeed.
So with all that, should we use containers in production? "Maybe." Do we have stateless services; a large, heterogeneous platform; and time, money, people, and organizational support? Sure. If you have a monolith and few services, a small team with no support, or databases with nicknames? No, it's probably not right for you. One litmus test is "Do you want containers or a blog post?" If a reason is "It'll be rad" then absolutely don't don't don't do it.
Containers are just another tool in your toolset.
Keynote: In Search of Security Shangri-la
The second keynote, "In Search of Security Shangri-la," was from Rich Smith of Duo Security, and when he says "security" he means "and privacy" but doesn't want to say it the whole time. Alice Goldfuss summarized his talk on Twitter: "Security professionals need to grow up past the hacker identity and be adults that are handling life-impacting data."
His hypothesis is that the security industry generates fear, uncertainty, and doubt (FUD) in order to sell hope. Hope is not a strategy but it makes security a lot of money. Let's extend that to us as sysadmins: We let them get away with it by not holding them to account.
How can we make progress? Be better partners with the end users. Blaming the end users for breaches isn't particularly helpful. Use the lessons from the past decade in DevOps to help Security: Put people first. Humans aren't the problem, unusable or problematic security technology is the problem.
Information Security (IS) is the overlap of people and technology. As people become more dependent on technology for daily life, IS is more important and necessary. IS has been focused on the technology half, which won't lead to a full solution.
"Usable security" is a term nowadays. That implies that a lot of security is unusable, like a gun that shoots the holder.
We need to merge accountability and responsibility better. We need to get DevOps and Sec working closer together. But Security has an image problem: The black hat, hide things, being secret squirrels... is not needed in general company security.
He shared his first hack, breaking into a password-protected gaming system he shared with his brother. And then changing the password. The reward cycle here led to the immature mindset that infects the security industry.
He shared some antipatterns or truisms:
- Security is not an absolute. It'll always be incomplete in some aspect as it depends what you're trying to secure. Think about what's"secure enough."
- Security is not an on/off binary. Secure against whom, with what budget over what time period? Context is everything.
- Security is a vector to be traveled not a point to be reached. There's a direction and magnitude but not a reachable destination.
- Security is not static. Needs change over time. You need to reassess based on the context (including time).
- Security is not zero risk. There's always residual risk, and that's fine: Mindful acceptance of risk is perfectly appropriate.
If you're going to be measuring, measure the right thing. "Click through a phish" isn't necessarily what you want to measure, since click-through might be to see if it's a phish before reporting. Measure the report rate.
We need to be roadies not rock stars. So what's a well-functioning, mature, progressive security team like?
- Enabling. Don't measure what you block, but what you enable for the rest of the business?
- Transparent. Be open about why you're doing something, not just what you're doing. Spread understanding through context and work towards a solution ("Yes and").
- Blameless. Failures are going to happen. You can only understand true causes if there's no blame, and you can't learn and improve without that understanding.
So how do you build a positive security culture? Most importantly, don't hire assholes. It sounds obvious and is harder than it seems. We, as a community of technologists, need to understand that it's more than just hard techie skills. Horrible people to be around are toxic and undermine everything.
In summary:
- The security industry is failing to serve those it's trying to protect.
- Security is, by definition, people centric. Move away from a security focus and make people the center.
- If the solution isn't usable, it's not a solution.
- Security is a shared responsibility and we need to make the marriage, with users and the business, work.
- Hold your security partners to account.
- Take the DevOps lessons and apply them to security.
- "Enabling, transparent, and blameless."
Fuzzy Lines: Aligning Teams to Monitor Your Application Ecosystem
After the morning break I went to Kim Schlesinger and Sarah Zelechoski's talk "Fuzzy Lines: Aligning Teams to Monitor Your Application Ecosystem." DevOps is the dream but you have to collaborate across teams and sometimes companies in real life. Strong interpersonal relationships are required.
People: Dev and Ops need to be strong participants in a partnership, with shared goals and direction, and teams are thoughtful and accountable. Their three recommendations were:
- Develop a group narrative: Who are we?
- Start with a history lesson.
- Create a relatable vision.
- Everyone should be able to tell the story.
- Commit to a set of shared values. How do we work together? (And include all the relevant teams!)
- Ask yourselves challenging questions.
- Focus on human interactions.
- Keep it short and make it memorable (acronyms help).
- Require commitment and active participation.
- Self-regulation. We're all in this together.
- Create simple prompts that allow individuals to enforce shared values. (They use their bee emoji for yay and "We don't do that here" for a re-evaluation.)
- Set the expectation that feedback is welcome and necessary (introspection not conflict).
Process: Both teams understand and maintain their individual responsibilities, while also helping in the gray area. Their three recommendations were:
- Define shared responsibilities and expectations:
- Define areas of responsibility, not tasks. Example: Don't limit "deploy" to just one team.
- Where lines are fuzzy, help each other. If one team's paged for another's problem, can they fix it anyhow or do they need to hand it off?
- Shared Slack (or other chat function) channel between teams:
- Targeted discourse and conversation. (Limit access to a single shared channel with that customer, not the whole instance.)
- Being open and public. Helps with postmortems since everything is timestamped.
- Weekly syncs (facetime, even if not in person):
- Increase transparency and understanding.
- Sync priorities across teams.
- Share the impact and value of work. Appreciate each other.
Tools: Monitor your infrastructure system and workloads in both ops and dev, increase confidence in monitoring, and decrease the time to resolution. They recommend two.
- Shared Monitoring Platform. Ops monitors send high-priority stuff to PagerDuty and everything else to Slack. Ops doesn't track dev's monitors, but since they share the platform everyone has access to everything. They use Datadog; others exist, including New Relic (which LSA TS WADS uses). (Story: The tale of the phantom scaling.)
- Monitors as code. The slides have specific monitor lists. These are repeatable, familiar, transparent, shares work with others, allows collaboration via PRs and code reviews, and more accessible for people using screen readers.
Build relationships between people, use processes to increase communications, and shared tools.
How Math, Science, and Star Trek Help Us Understand the Value of Team Diversity
The second talk in this session was "How Math, Science, and Star Trek Help Us Understand the Value of Team Diversity" by Fredric Mitchell.
Using the past to understand the future is bias, not necessarily positive or negative. Discussed heuristics and bias and diversity. We need to define diversity. There are generally three types: Demographic (gender, race, orientation, and so on), experiential (affinities, hobbies, abilities, skills, and so on), and cognitive (how we approach problems and think about things). Diversity in general is all three of these.
He used Star Trek: Voyager as an allegory: They merged two different teams (Star Fleet and Maquis).
- Kate Mulgrew as Kathryn Janeway was the first female captain to carry a series, and very out-of-the-box, impulsive, and dogmatic. She was driven by curiosity to keep her crew alive. "Keep your shirt tucked in, go down with the ship, and never abandon a member of your crew."
In a team-building context, am I open to unorthodox thinking?
So how can we measure emotional intelligence of groups? Teams with higher EQ tend to perform better. Measure by trust, group identity, and group efficacy.
- Intelligence Quotient (IQ) — ability to learn, not what you know; doesn't change over time
- Emotional Quotient (EQ) — ability to recognize others' emotions; changes over time.
- Tim Russ as Tuvok was one of the first black actors in a role where it wasn't relevant. "On the contrary, the demands on a Vulcan's character are extraordinary. You cannot mistake my composure from ease."
Women are key to smart teams. There's a high correlation between the number of women in the group and high collective intelligence.
Why do diverse teams outperform? Teams focus on facts, processed more carefuly. There's no such thing as common sense.- Roxann Dawson as B'Elanna Torres. Actress was half-hispanic, character was half-Klingon. Determined to prove herself all the time.
So how are we evaluating talent? He referenced Scott Page's The Difference (2007). "Hire the most qualified" is all well and good, but how do you define "most qualified"? What's your perspective and heuristic? Therefore you must change your perspective to arrive at a new heuristic, which is what leads to innovation.- Jeri Ryan as Seven of Nine, formerly-assimilated Borg. Survival and winning are the only things that matter and everything else is irrelevant.
So why should we care about team diversity? Vanity ("great company on my resume"). Virtue. Fear. Profit.
Looking at fear and profit: Diversity trymps ability theorem: Conditions are that the problem is hard, each solver ahs a local optima to the problem, and an improvement can be made to non-optimal solution, and there's a large pool of solvers. (Math is in the slides.) Basically the model says to think of people as collections of tools, and ability is a reflection of the applicability of those tools to a given set of problems.
The pipeline problem of talent is actually an incentive problem of industry. There are people out there but business is mis-incented for diversity.
In summary:
- Diversity is a fact.
- Inclusion is a practice.
- Equity is a goal.
How can we grow?
- Connect various recruiting metrics to compensation.
- Document and send a monthly top-3 challenges and tech stack used to an NSBE and SWE chapter president as part of "Free code Fridays."
- Follow at least 3 devs of color, women, or nonbinary folks and meet for lunch.
- Get out of "prove me wrong" mentality for "what can I be missing."
- Focus on being successful not right; pay attention to the goal.
- Subscribe to the eqfordevs.com newsletter.
Lunch
Lunch was on the expo floor in the exhibit hall. Today's selections were ham, turkey, curried vegetables, or roasted vegetable sandwiches and wraps, house-made chips, green salad, a roasted vegetable-and-brown rice salad, kale salad, and cookies.
Earthquakes, Forest Fires, and Your Next Production Incident
After the lunch break I went to Alex Hidalgo's "Earthquakes, Forest Fires, and Your Next Production Incident." He started with an anecdote about the The Laguna Fire in September 1970, then the third-largest recorded fire in California, burning over 175,000 acres of woodland, 382 residences, and 16 humans died. That fire season burned over 576,000 acres, 722 buildings, and over 1.5B in damages was assessed. It was really bad.
Changes emit more changes.
Research led to four recommendations or conclusions:
- Formalized communications
- Formalized hierarchies
- Formalized response
- No more freelancing
Incident Command System (ICS) was used in the 1978 fire season and was considered a success. It was then adopted for other forest fires, HAZMAT situations, all natural disasters, and urban search and rescue. In Nov 2002 DHS made this process mandatory for all responding organizations in the US.
So why should I care? The tech industry is hurtling towards adopting known processes instead of leading to invent or innovate. None of what we're doing is really new; engineers have been focused on reliability for as long as humans have been building things, and statisticians have been analyzing data for centuries, and the ICS has now been around for decades.
What's most relevant for us in technology? It addresses these problems:
- Lack of insight
- Poor communications
- No established hierarchy
- Too much freelancing
The ICS itself is very complicated since hurricanes, forest fires, and other disasters we don't need as many people as it does. We can use a slimmer or less complex version of ICS; we should pay attention to the philosophies more than the specifics.
No matter how you choose to implement it, you must have an incident commander (IC); they're in charge, holds all the high-level state about it, is the only role that must always exists, and can delegate other roles to other people if needed. Unless delegated, assume the IC holds all roles. Other roles include:
- Operations lead (ol) is the one responsible for making changes to the system/s. All changes must be documented (what and when — including pasting individual commands with date stamps). This role must be delegated by the IC. You can offer to help but you can't be ol until the IC says so.
- You need a command post, discoverable, for all stakeholders to see the current state. It can be a slack channel or conference room with bridge phone line, whatever works for our environment. However, text is better than voice. Making a Slack channel read-only might be helpful.
- You might need a communications lead (CL), both internally and externally. They should be the only one updating things like your status page. They should probably be keeping up the Incident State Documents (ICD). Like Ops Lead, this role must be delegated by the IC. You can offer to help but you can't be CL until the IC says so.
- ISD consolidates the state of the world, and documents which roles have been defined and who currently has them. (Templates and tools are good.) The ISD can be built into the tooling (some vendors have this).
- A planning lead (PL) is in charge of supporting the other leads on needed, that they have the resources they need. Can consider the future (proactive following of tickets for followups, finding people to hand off to if it's long-lasting, ordering pizza for everyone, etc.) Like Ops Lead, this role must be delegated by the IC. You can offer to help but you can't be CL until the IC says so.
ICS is by design very flexible. Think about the concepts and philosophy more than the specifics.
The ol might delegate to subleads for database operations, networking operations, or system operations — each of which have specific knowledge domains' expertise.
You might consider handoff times scheduled in advance. It's psychologically affirming to know there's an end state. Handoff doesn't have to be complex. A good ISD help a lot.
Someone getting paged can be the start of an incident. If you're asking if it's an incident it probably is. It's easier to declare an incident for something that turns out to be small than it is to apply the framework to an incident after a significant amount of time.
Build your culture to understand that things break. Declaring an incident might make senior or executive leadership aware, and that has to be okay. Don't try to hide an incident.
It works... if you test, anticipate how things can fail, and consider all the possibilities, and make sure you're ready and things are documented. Have trainings and workshops Havre tooling and templates, automate what you can, and conduct meaningful incident retrospectives or post mortems after. And test! Blackhole a data center, use chaos engineering, whatever it takes. People forget how to fix things if they haven't had to fix things in a while.
Let Your Software Supply Chain Ride with Kubernetes CI/CD
The second talk in this session was "Let Your Software Supply Chain Ride with Kubernetes CI/CD" from Ricardo Aravena. Since we're not actually doing a lot of CI/CD work in our immediate environment I didn't take a lot of notes. (He mentioned a lot of tools. Check his "Takeaways" slides.)
Storytelling for Engineers
Following the afternoon break I went to "Storytelling for Engineers" by Brad Shively from Uber. A lot of emails start with "If you don't _______ then ________" is a common start of emails. If you don't communicate with other humans then this talk isn't for you. But it's about email: Are you getting enough? Is what you get actionable and usable? "Don't make me WTF" is Brad's rule of thumb.
Thesis: Communication is an essential engineering skill, and we need to communicate better with non-experts — in email, presentations, and meetings. We do okay within our teams since we have shared context, but we hit snags going outside those teams. By and large engineering communications is broken and storytelling is the fix. Everything is a story and everyone is a storyteller (mostly). Most of us are bad at it, since concise communications is hard. Dumping facts and data into email is easy.
Why tell stories? "Give them the facts and they'll figure it out" doesn't work without the context you have and they don't. Storytelling causes brain chemistry changes. They result in better understanding of key points, increase voluntary compliance, and improve memory. Want to learn more? Go watch the TED talks. But for engineers:
- Identify your audience first. Everything else is built on this. How would you update your CEO versus your tech lead about the same project? The former is more business speak in the long-term; the latter more short-term technical activity.
We all assume and believe that there's a big chunk of what everyone knows, and my customer knows a bit more, and I know a lot more than that. In reality, though, everyone knows a little bit, my audience knows a lot of things... and more than I know. There's some overlap between the little I know and the lots they know, and we need to build the bridge. As a storyteller I need to map my brain onto my audience's brain. There are few global variables; everything is a local variable for internal state.- Once you know the audience, figure out what story you're telling. Who's your character? In this context, it could be a system, software, or release. It's what's changing or different. It could be a person or team. Then, what's your character arc? How does the specific change affect my audience?
- Then tell the story with context. That context bridges the gap between what I know and what they know.
Example:
Subject: Remote Testing
Body: Remote testing is now live.Is that good or bad? It depends: To your team at the end of a long-term project it might be good ("The Eagle has landed"), but to the entire organization it's probably terrible. So does this means that tests are better or worse? Do I want the recipients to take some action. "X and Y therefore Z."
Just data isn't useful. How about this?
Subject: Remote testing results
Body: Remote testing results. Attached are the results.Data is useful with context, analysis, and insights.
Wisdom hierarchy: Wisdom → Knowledge → Information → Data (pyramid). Give me an analysis — what changed, what does that mean?
Body: TLDR : A one-line summary.
If you { use | care about | ... } ${foo}, them keep reading.
${foo} is now better/different in aspect Y due to change Z, with optional graph or image, and "If you're really interested see the details below."
So our example becomes:
Subject: Remote CI testing is now faster
Body: TLDR: Faster test results of queue submissions by 100s of seconds!
If you don't build in test repo ${name}, you can stop reading.
The repo ${name} test suite now supports yadda yadda, reducing the test times. No action is required by you to see these benefits.
Those interested can see the mechanics of the change in ${reference}.
Tactical summary:
- Think about who you're writing for.
- Context is ridiculously important.
- There are very few globals.
- Ask. Be direct and clear. Subtlety is not your friend.
- Tell. Be direct and explicit: What's the story and conclusion.
- Use the formula as a last resort.
Q: What about receiving an email with insufficient storytelling?
A: If this is something you care about or want to invest in, coach them. Be sensitive in any criticism. "Hey I think these key points could be clearer and it might be worth a follow-up" can work.Q: What about real-time communications like in Slack?
A: Real-time communications solves for a slightly different problem. They had a "problem" customer in email that was great in real time. Moving to mostly real time communications and away from email they could smooth out inconsistencies. But having a thesis and narrative arc can help regardless of the medium can help.Q: How do you handle communications when you're trying to impart a message to different audiences?
A: Factor out the commonalities to send one to all and then specifics to the subgroups. Consider what can be abstracted up/out. "Here are the highlights" for everyone and links to docs for group-specific details.End of the Sessions
There was nothing in the second half of the final blocks I wanted to see, and I didn't need to traipse around the expo floor again (lunch today was and tomorrow would be more than sufficient), so I went upstairs to grab a power nap. (In retrospect I should've gone to the zsh workshop instead of the storytelling talk.)
Evening Events
After the sessions and happy hour ended I wound up going out to the 0xdeadbeef dinner at Morton's with Branson. We eschewed appetizers and salads and went right for the mains: He had the 24-oz. porterhouse and I had the 22-oz. bone-in ribeye, and we shared a bottle of malbec and two sides — the bacon and smoked gouda au gratin potatoes and grilled asparagus with a balsamic reduction. I had the chocolate mousse for dessert with a Croft 10-year ruby port; he had the creme brulée and limoncello. (And with a good tip — the service was good and the wait staff was chatty — I still came in under my expected budget, so yay.)
We got back to the hotel around 9pm and I took advantage of the side effects of the wine and port with the heavy meal to go immediately to bed.
I slept in until about 5:30am today. By the time I have to go home from the Pacific time zone I might actually be closer to being on Mountain or even Central time. (Home, of course, is on Eastern.)
After my morning ablutions I headed down for the continental breakfast (mostly croissants). They were set up early again, and Philip brought homemade ginger snap cookies made with bacon grease. After a wandering discussion with an ever-growing group I wound up heading into Salon F for my morning sessions.
Pulling the Puppet Strings with Ansible
My first session today was Brian J. Atkisson's "Pulling the Puppet Strings with Ansible." Red Hat switched from cfengine to Puppet in 2007 to solve both technical and people challenges, and they moved their internal IT systems to it. They noted there was a gap with orchestration (for example, "turn off monitoring, drain connections, upgrade node, bring back up, lather rinse repeat, turn on monitoring") so they rolled out Func in 2009. They moved to Ansible in 2014.
Shortly after acquiring Ansible, Red Hat open-sourced Tower as AWX. It gives centralized playbook execution, RBAC credential management, remote (API-based) playbook execution, autohealing systems (for example, Nagios checks can trip an NRPE event handler to tell Tower to run a playbook and only send a page if the autohealing didn't succeed), and metrics.
They eventually went all-in to use Ansible for their AWS and OpenStack infrastructure, VM provisioning, traditional data center provisioning, OpenShift, and CI/CD.
App teams own and manage their own VMs, top to bottom, and Infrastructure provide/publish suggested Ansible roles (such as Puppet Modules or Chef cookbooks), and they manage roles centrally with Galaxy.
They're migrating such that RHEL 6 is Puppet only, RHEL 7 is either Puppet or Ansible, and RHEL 8 et seq. is only Ansible.
Remember that writing a script can be fine for the simple cases; Ansible may not always be the right answer for you. But you can write playbooks and contribute them upstream. Use service auto-discovery, keep user permissions and authorizations in a tool designed for it, and Ansible is not great for managing secrets (such as HashiCorp's Vaults).
Lessons learned:
- Only manage what you need to manage; configuration management is not auditing.
- Use authoritative data sources where possible like LDAP, DNS SRV records, service mesh, and dynamic inventories.
- Don't be too clever.
- Tools don't fix people problems.
- Let unmanaged things remain unmanaged. No manual changes!
- Use Tower (Red Hat-supported) or AWX (free, upstream). It does a lot of work for you.
[Ed. note: This included more Red Hat history than I was expecting.]
Q: What tools are you looking for with secrets management?
A: Ansible Vault locks the secrets away so apps need to go through Ansible to get to them. They're looking to OpenStack and Barbican, HashiCorp's Vault (both free and paid-enterprise).Q: "Only manage what you need to manage" and "Make no manual changes" seem contradictory. Can you explain?
A: If you're going to need to make manual changes then the thing you're changing should be managed. AWX/Tower is great for centralizing the consistency. But don't over-manage things that will never need to change.Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis
The next session was "Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis" from Adam McKenna. Scale-wise they have 30M monthly active users, ~80,000 hosts under management, most of which run dockerized microservices. The talk is aimed to DevOps, SRS. SA, or RelEng who doesn't like having prod issues and wants to improve deploys' reliability. Takeaways: What is and isn't canary analysis, where's it useful, and some practical considerations.
Before you can deploy canary analysis, you have to already be using CI/CD since it's a step in that process. Critical service metrics need to be in a time series database. And you need internal political buy-in from the highest possible level.
For our purposes, "invisible" means no significant time investment is required from the service owner, and "infrastructure" means the type, CPU generation, storage technology, networking stack) OS, and language (runtimes, library dependencies, and so on).
Given that, "canary analysis" is comparing two things: Your production and canary environments. Canary analysis does deploy code to production. There's a risk in rolling out bad code to some users, but it's better than rolling out bad code to everyone at once.
Infrastructure has an expiration date (examples include Ubuntu and Win7 Eols, Python 2-to-3 migration, and Oracle Java to OpenJDK). Therefore upgrading is not optional. Maintaining compliance, supportability, and new features all come from upgrades, and developers don't want to have to write dependencies. Doing quick migrations is a goal. Unfortunately upgrading is hard: It's complex, service owners don't like downtime, and migration work is generally not a preferred task (cleaning up is boring).
Canary analysis helps automate and normalize the mundane migration or upgrade tasks. Automation helps eliminate the toil in comparing metrics; normalization applies it to all of the [similar] systems. However, it's not a replacement for unit or integration tests or (accurate) service health checks. It's not mysterious or magical. It's not off the shelf (there's a Spinnaker plugin) so custom code is required. It's not instantaneous; you need to schedule time in your deployment to spin up the canary clusters.
Components include:
- CI/CD pipeline, workflow orchestration
- TIme series metrics database
- Execution environment for custom code
- Canary Judge software
The canary needs a minimum of 50 data points, with similar load to production, and useful health metrics (latency, traffic, errors, and saturation are common).
It would have been about one year of effort for 2–3 FTEs, writing a 700-line custom Python script, and writing a Kayenta OpenTSDB plugin.
Lessons learned:
- UX is important. Good UX will affect how service owners adopt this.
- Choose your pilot customer/app wisely.
- Save good and bad builds of your app to test with.
Q: Can you talk about how this was received and are there metrics on automated versus manual analysis?
A: No metrics, but having a pipeline sends everything to prod. They kept track for a while; before the canary judge made automated decisions it caught more than 50% of the problems before they went to prod, which helped people adopt it.Q: What kind of UX do you mean?
A: Overall experience. How easy is it for the service owner? The easier it is for them the more likely they are to adopt it. In their case, the SREs had to determine sane defaults for the metrics, since the service owners might not know that. Kayenta provides decent graphs and standard deviations.Q: What sort of statistical analysis happens?
A: Canary Judge supplies the Mann-Whitney Hughes test, comparing two sets of data (that don't have to be the same distribution) to identify the point in one set being significantly the same as a point in the other set. There are multiple algorithms to use. You have to write a lot of glue code. The service owners wanted a threshold for a metric or set of metrics not to be exceeded, which isn't supported so you need to have that in your code.Q: You have a stateless app to compare metrics. How do solve this for clustered database or other stateful systems?
A: We don't; we're only using stateless systems so far.Q: We have a high level of what we want to provide customers. Can you replicate what you see in production in a test environment, or does this have to be in production?
A; If I had to, I'd mirror production traffic to an isolated set of services that don't actually send responses to customers, but (a) it's complex and (b) depending on the app you might need those responses.Q: How do you know how long to wait before Canary and Full Release?
A: 50 data points, which could be a minute or several hours, depending on the application. Best practices say 30–120 minutes in general.Q: But what if your environment has time-of-day spikes?
A: We do many deploys a day, and we are about commit-to-production.Q: Do you have any instances of false positives or negatives?
A: Yes, a lot, especially at first. False positives are because you haven't determined which metrics are important and may be misweighted. You may add rules (75% ok, 50% marginal, less than that requires human intervention).GitOps, an Elegant Tool for Hybrid Cloud Kubernetes
Next up was Ryan Cook's talk "GitOps, an Elegant Tool for Hybrid Cloud Kubernetes." He went through their experience thus far and followed by a live demo (of ArgoCD managing a multi-cluster).
GitOps is basically or fundamentally YAML objects stored in git repostories with version control, with templating, and running while 1 do { kubectl apply -f $path } with no object type limitations. Best practices are typical:
- Run with least privileges necessary.
- Store code and Kubernetes objects in different repositories.
- Kubernetes objects should be in a private repo (secrets and routing information).
- Document the process to (re)create GitOps' artifacts.
The evaluated ArgoCD, Weaveworks Flux, Kohl's Eunomia, and Kube-applier. When choosing one, look at what functions you need, what's local versus remote, whats pull versus push, and so on. ArgoCD has a great web UI, links nicely with Helm or Kustomize, has multiple sync options (including pruning) and multiple cluster configurations.
Takeaways: This example used OpenShift but apply to pre Kubernetes, GKE, Azure, and so on. It'll work for any cloud provider.
Ops on the Edge of Democracy
The following session was Chris Alfano and Julia Schaumburg "Ops on the Edge of Democracy." He started with how he got into education and technology. So to reframe, what should be in the cloud and what on the ground, when optimizing for local innovation instead of global scale? What about growth (cloud) or evolution (ground). Cloud is frictionless (no humans) but ground needs human interaction (education, human services, and so on).
This applies to more than just education.
Thesis: Civic hacking is the engine of change. Code For America believes that government can work for and by the people in the digital age. They've got over 76 brigades around the country (and beyond as Code For All) doing more than just government. What about accessibility in public transit?
Imaging if you volunteer your technical skills at a community-organized hackathon. You're there to help. Someone comes to you to help them with someone else's code... when they don't have the skills or the environment (and dependencies). "It's easy... if you learn these other dozen things first" isn't helpful, useful, or even necessary. We need to build a new toolkit where you can replicate an environment with a single click and not needing to know the whole stack first: A common community infrastructure without proprietary or licensed technologies that'll bite you in the ass later.
Idling must be free. You need to keep projects around for people to be able to work on them, without having an insane AWS bill.
Julia came up next to talk about building with not for. Experts hack things and create tools. Her goal is to help experts use tools to make them effective. So operations on the edge of democracy.... Software is democracy, government, and the mechanism of government. Everything they do is implemented with technology. We're basically at an inflection point. Rescue the person drowning in front of you but also go upstream to find out why everyone's falling in. There's too big a focus on profitability and time-to-market and not enough on research and working with the actual end users.
So... "[Screw] the easy win." We need to make software for the people in the back of the ambulance (nurses, emergency services, social workers, and teachers). UX is everyone's responsibility.
So what can I do? Get involved:
- Civic hacking groups (worldwide) either on an existing project or someone who can help map from technical needs to new or existing code or vendors
- Think about pushing power down to the ground, not isolating things in the cloud, as we push forward.
There's a worldwide revolution at federal, state, and local-level digital service teams. Talk to the help desk about how our environments are broken and pass that on to product design. They need engineers.
(Example, the healthcare.gov initial launch debacle and how we-the-industry started making process changes.)
Lunch
Once again lunch was on the expo floor. Today it was roast beef sandwiches, chicken caesar wraps, and the same salads and chips as yesterday, with a slightly different cookie selection for dessert.
Block 3
After lunch there was nothing I really wanted to see so I wound up getting off my feet for a while back in the room.
Workshop: Running Excellent Retrospectives: Talking with People
After the afternoon break, I went to the "Running Excellent Retrospectives: Talking with People" workshop by Courtney Eckhardt. The subtitle could also have been "Words mean things." Remember that English isn't just limited to North America — or even one region of North America.
A lot of this boils down to facilitation skills. Consider starting with ground rules, safe space, and what might be learning experiences.
Running a retrospective means facilitating, running a productive meeting, and not making bad jokes.
JOB ONE: FACILITATION (see also Incident Commander)
English is a blamey language. We don't want blame in the retrospective. Consider avoiding second-person; starting something with you creates an oppositional statement. Consider avoiding why because it puts the addressee on the defensive. Other words to avoid are extremes (always, never, every time, should, just, and only).
Consider the difference between "Why didn't you just fix this the last time this happened" and using how, what, what if, could we, what do you think about, what would you have wanted to know. They all get to more complex answers. You want long complex answers. Retrospectives are creative — in putting together the question and in delivering the answer.
People don't do things for no reason (even if they don't understand what that might have been).
JOB TWO: RUNNING THE MEETING
- Select a notetaker:
- With explicit responsibility (not the facilitator).
- Ideally not the same person all the time. (Be ready to draft someone, or decline the offer from the regular person.)
- Ideally not a person who was involved.
- If it's not written down in the official record it didn't happen.
- Stay on time and on topic:
- Don't filibuster.
- Don't bring up your sacred cow.
- Does it need to be brought up now?
- Watch (keep an eye on) the agenda.
- Active listening:
- Eye contact, good body language (lean in).
- Don't interrupt.
- Rephrase or restate when relevant.
- Make sure everyone has a chance to speak even if they have nothing to say. Are they shy? Unwilling to interrupt? Facial expressions and body language. Ask the silent people by name (if you know them well). Try it at least twice. ("Has anyone who hasn't spoken yet got anything to add?")
JOB THREE: DON'T TRY TO BE FUNNY
- Kind caring warm welcoming thankful
- "Please don't make jokes like that here."
WHAT ABOUT MISTAKES?
- Apologize.
- Correct yourself.
- Move on.
Facilitators may need to interrupt (for a purpose, achieving the meeting or retrospective goals). What happened if you didn't at some point when you should've? Be gentle but firm about it.
Q: Is there a benefit if the facilitator isn't part of the team?
A: Yes, especially in big multi-team retrospectives. Sometimes a senior manager from somewhere uninvolved, sometimes a good facilitator from elsewhere.Q: Should the facilitator take notes?
A: No, never.Q: How do you balance keeping things on time and on topic with making sure everyone is heard?
A: Tooling and templating can help, especially with timestamps. The facilitator may need to be, or delegate to, a timekeeper for that. Come back later (either add to the document themselves or leave a section at the end of the meeting).Q: What about devil's advocacy?
A: In a retrospective that's generally not appropriate. "We're not here to talk about things that didn't or might happen, but what did happen and what we can change."Reception
After the workshop I wound up chatting with a few attendees and our facilitator until it was time to head to the conference reception. (It's still odd writing that; for over 20 years the conference was longer and the reception, while still on the penultimate day, was Thursday.) The food this year was four stations: Vegetables (which included some fresh fruit), vegan (poke bowls with tofu as the only protein), salmon (over asparagus risotto and a pepper jam), and beef short ribs (over a sweet potato hash with onion jam). What I had was tasty. They also did an on-demand t-shirt printing station again (though unlike last year, this year we just stood and watched and did not get to participate).
After the reception I swung back to the room to drop off my laptop and get some of the stuff consolidated (all the swag for the coworkers into the conference bag, laid out the clothing for tomorrow, putting today's worn clothing into the laundry bag, and swapping out the kilt for the sweatpants) before heading down to the lobby bar for schmoozing until it got too tired out to continue.
I somehow only managed to sleep in until 5am today, so I caught up on my social media and work email, and composed and scheduled some replies to go out when I'm officially back on Monday the 4th. After ablutions I went down for the conference continental breakfast (today it was bagels).
Containerizing Your Monolith
The first talk was "Containerizing Your Monolith" by Jano González. As many companies started to move from a monolithic application to microservices, but what happens when the environments start to diverge? They moved their monolithic application from their old environment to the new (Kubernetes) infrastructure. The main takeaway is the balance: How much do you invest in the traditional environment as opposed to your new microservices environment.
The project took 1.5 engineer-years from start to finish. It's not as simple as "Move it to Kubernetes." The milestones:
- Create dockerized dev container with tests on GoCD (and got rid of Jenkins).
- Proof of concept: Deploy first staging component. (And first problem: Init script → Nginx → Passenger → Passenger process -?-> Rails app, where variables were lost in the last step. Solution: Put the dynamic variables in Perl.)
- Productionizing the app: Deploy and monitor, handle logs, and so on. Extract metrics from the Kubernetes pod. Problem: Can't run blind. Solution: Rotate the log aggregator.
- Productionize the public API, which orchestrates calls to other services. Problem: DNS latency and excessive requests. Solution: Move Kubernetes to CoreDNS and then hack it to reduce the search space.
- Productionize the internal API. Problem: High latency. (Surprise, containers are less efficient than bare metal.) Solution: Optimize GC and make cheaper SQL queries. Next problem: Deployment leads to error spikes, so the kill -9 happens to happy. Solution: Use the pre-stop trick (kill with SIGHUP before SIGKILL). Then errors on start, so need pre-start (built-in timing delay) to have processes ready by the time the health monitors run.
- The rest isn't as interesting (workers, cronjobs, shell, migration hosts, and cleanup).
Currently they have about 1,000 on-prem pods (handling 25K RPS) and ~140 cloud pods (handling about 3K RPS).
Conclusions: They moved from two to one infrastructure, and from two to one delivery process. They did it with a project plan, controlled rollouts, and managing expectations. Benefits include improved utilization and improved confidence, and enabling new initiatives through the engineering environment. Why should we do this? Assess progress in moving (or not), and the costs and benefits.
Q: Monolithic apps means monolithic database. How do you solve for that?
A: Every time we extract the service we create a local-to-it database. They publish life cycle events across databases to keep things in sync. They also can export data sets to consolidate in the data warehouse.Jupyter Notebooks for Ops
The second talk was "Jupyter Notebooks for Ops" by Derek Arnold. First, what are they, how can we as operations people use them, how can they be maintained to facilitate them for non-ops people, and finally what's on the horizon for them.
What are Jupyter Notebooks? They're an open-source web application that allows you to create and share documents with live code, equations, visualizations, and explanatory text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and much more. They're based on existing tools and protocols, including iPython, and flexible (for example, you can use PowerShell). You could drop in other Python modules if you're so inclined. (Recommendation: Use virtual environments because using the operating system's Python is a bad idea, especially if you're on an older OS version.)
What do you do in one? Anything you can do in Python itself, plus more. You get runbooks with better documentation (whys and hows and history with the commands themselves) and experimentation and consumption with web data via REST APIs. Note that you can use languages other than Python (C#, F#, Mathematica, MATLAB, and R)... though doing so may add dependencies (for example. MATLAB requires Wolfram software).
Even if you don't run it, someone in your organization may (or may want to). There are hubs to allow multiple notebooks in one environment, such as Zero to Jupyter Hub in Kubernetes.
What's next for Jupyter Notebooks? Jupyter Lab is the next iteration. Instead of the current format it's more of an IDE look and feel.
He used a Notebook to run the presentation and showed us the source. He had connectivity issues with the cloud-based version so he fell back to the version on his laptop.
There are lots of shortcuts that let you do various things, like running command line code and pulling Python source into Python notebooks. There is a notebook server, which is very robust; it allows you to convert notebooks to other formats (like a PDF for slides).
Q: I used them long ago. Does Iron Python have anything to do with iPython?
A: No; iPython is an open source project and an interactive Python shell, Iron Python is based on .NET. iPython is as if bash was written for the Python space.Q: The littlest Jupyter Notebook docs say "Don't run in production." What're the security issues it warns about?
A: It's mainly because it's beta software. In more-secure environments consider Microsoft's Azure offering. It's not like it's "insecure" in a hackability sense.Q: Related to security, is there a way to run Python in a more sandboxed environment? http2python is a concern.
A: Mitigate the risk by running it on localhost or requiring an SSL certificate.Testing for the Terrified: How to Write Tests, Conquer Guilt, and Level Up
The third talk was "Testing for the Terrified: How to Write Tests, Conquer Guilt, and Level Up" by Frances Hocutt. Basically this is how to get started with automating unit tests, for non-developers who have to write the occasional code. The talk is coming from a harm-reduction approach.
Testing is cool. Automated unit testing promotes helpful code habits (modularity, reusability, and so on), reduces both debugging time and duplicated work, and can be "free" documentation. So if it's cool why don't we do it all the time? Pressure, both on ourselves (what framework do you use, how do you test the network for an API, and so on) and from others (patches need tests, untested code is legacy code, "What do you mean you've never done tdD" as a barrier to entry, and so on), and lack of resources (such as skills and time).
In public health, there's a concept of harm reduction. People have reasons for what they do (even if you don't agree with or like them). It's more effective to meet them where they are and offer resources for making the changes they want to make. That works better than draconian or shame-based approaches.
Unit tests' basic idea is to find a unit of code (like a function), give it controlled input, and see what output you get... and if that result is expected ("2+2=4") or not ("2+2=5"). But writing tests isn't quite that simple: You need to have these skills:
- Write code that isolates side effects or network interactions.
- Find the parts of your code that are easier to work with.
- Figure out useful inputs to test with.
- Write code that tests what you want to test.
Where should you get started?
- Write tests that document what your code does now.
- Start somewhere. 10% coverage is better than 0%.
- Start with pure functions (where there's always only one input and one output. (No I/O, disk writes, API calls, changes to global variables, or other side effects.)
- Give the functions some inputs, find the outputs, and write it down in a test.
Given that? Let's test! (This was followed by a live demo. He uses pytest for testing his code (also in Python). He builds a test with a known failure first to see what success should look like, then edits the test to expect that output. If the code is failing, write a test that passes with that "bad" behavior then change the code. If that test now fails then you fixed it.)
You want all the tests in your suite to pass, even if the code is expected to throw an error. (You're testing that the code works.) Some test frameworks allow you to specify expected errors (like the wrong number of arguments, or a string when a boolean or number is expected).
What next?
- Practice the easy stuff.
- Practice refactoring your code so there's more easy stuff.
- Gear up for the harder stuff:
- Read code.
- Pair (but make sure your goals line up).
- Look at more advanced testing resources once you have a sense of how this works.
- Use a code coverage tool to get some numbers on there (e.g., coverage.py).
What Connections Can Teach Us about Postmortems
The fourth talk this morning was "What Connections Can Teach Us about Postmortems" by Chastity Blackwell. What do we mean by Connections? BBC aired a 10-part science series in 1978 (and PBS in 1979). They spent 2 years, $200M (in 2019 dollars), 22 countries, and the series went through the creation of modern technologies including mass communications and computers.
(The series is on the Internet Wayback Machine. The first and last episodes are different and may be more worthwhile.)
History is not a series of linear events but intertwined events with feedback loops. It's mostly storytelling. The way they're presented, and the language used, affects the perception. A postmortem or retrospective — that is, the conclusions or results document — is also a story.
How do we present a complex and nuanced subject such that people both understand it and understand its complexity? Burke in Connections shows the surprising relationships, such as the path from Roman aqueducts through looms to paper. Highlighting those odd relationships can be the most memorable.
Another technique he used is zooming in on a specific individual. If Alice figured out what happened, showing us how she did that can indicate how things can be surfaced before there's a problem.
Also, you have to look at the events of the moment without the foreknowledge we have in the future. It creates empathy with those involved and how they made the decisions and why they took the actions that they did.
Most importantly, Burke reminds us that it's more complex than we think. He's not providing a definitive view but more encouraging people to think about the complexity.
Writing a postmortem we need a through-line, but we need to make sure people understand there's complexity under the simple explanation. Include links to the data and logs.
Burke made the show entertaining and engaging with the tools of cinema to make moments memorable — from the presentation if not necessarily the facts. We need our postmortems accessible. Engineers tend not to write well, but:
- Use normal, plain, even informal, language. You want the artifact to be readable.
- Using GIFs is okay for conveying information, but don't overdo it. Make it relevant to the subject at hand.
- Don't just give a litany of events without the connective tissue between them. A timeline is all well and good, but context is necessary.
Avoiding the pitfalls: Despite an attempt to be international or global, it's still very western-centric. Nobody is bias free, so don't try to tell the story until after the investigation is done. Include others' perspective. Start with events (what and when), not reasons (how and why — where biases creep in). Looking at the primary sources helps. Real stories don't follow a three-act narrative like you learned in school. Avoid templates that force a narrative structure since some do have clear beginnings and endings but not all will.
Q: Do you have opinions about postmortems being presented, perhaps with conversations, not just "read this document."
A: It depends on the environment. How many postmortems are there, who all could attend them regularly, and so on? Even at the end you're going to want the written artifact. That said, for serious outages it might be better to have a presentation.Lunch
Lunch was again downstairs in the exhibit hall, even though the exhibitors closed up shop at 2pm Tuesday. Today's selections were caprese and caesar salads, penne with shallot Alfredo or italian meat sauce, and lemon-herb chicken, and tiramisu and chocolate swirl cheesecake.
Workshop: BPF Performance Tools
After lunch I went to Brendan Gregg's "BPF Performance Tools" workshop. There was a presumption of familiarity with the tools that did not hold in real life for me. That said, https://github.com/iovisor/bcc/blob/master/docs/tutorial.md#0-before-bcc is a good start. Built in OS commands include:
- uptime
- dmesg | tail
- vmstat 1
- mpstat -P ALL 1
- pidstat 1
- iostat -xz 1
- free -m
- sar -n DEV 1
- sar -n TCP,ETCP 1
- top
With the BCC tools:
- execsnoop
- opensnoop
- ext4slower (or btrfs*, xfs*, zfs*)
- biolatency
- biosnoop
- cachestat
- tcpconnect
- tcpaccept
- tcpretrans
- runqlat
- profile
Caution
Some of these are OS-specific and break with Ubuntu 5.3.
Note
Many need an "-bpfcc" suffix. They changed the naming convention.I'm not proud about it but I gave up in the middle of lab 2, after just over an elapsed hour.
Keynote: When /bin/sh Attacks: Revisiting "Automate All the Things."
After the afternoon break, the first of the two closing keynotes was J. Paul Reed's "When /bin/sh Attacks: Revisiting 'Automate All the Things.'" It could also have been called "Engineering Resilience into Modern IT Operations: A Play in Three Acts."
How do you know what to do when an incident is occurring? There are some heuristics to check:
- What's changed since the system was in a known-good state?
- Go wide, looking for other potential contributors.
- Convergent searching: Confirm and disqualify diagnoses by matching signals and symptoms. Engineers use a specific and past diagnosis (painful incident memory) or a general and recent diagnosis (something still in my mental L1 cache).
So how do you get better at knowing what to do when an incident is occurring? Need to understand what expertise means; experts use their knowledge base to:
- Recognize typicality (what's normal).
- Make fine discriminations.
- Use mental simulation (to see how we got to this point).
- Know when to break the rules.
Experts can visualize the past and imagine how it'll work, and see what is not there.
Transforming experience into expertise:
- Personal experience (challenges; for example, on-call)
- Directed experiences (teaching, code review, pair programming, writing wikis and runbooks)
- Manufactured experiences (training/simulation; chaos engineering, game days, and table-tops)
- Vicarious experiences (painful or memorable events; "I remember this one time when it was DNS")
Incident Analysis (post-mortems) antipatterns include:
- We only investigate failures, not successes.
- We forget about bias (plane landed ok which is outcome bias, we only say behavior is bad in hindsight which is hindsight bias;ascribing characteristics based on who someone is which is correspondence bias; and so on). Biases are built into the way brains work. We can't do without them so we have to recognize that it's there and address it.
- Deprioritizing retrospectives/learning processes.
Resilience in incident remediation and prevention. Cf. Ironies of Automation (1983). Some of the ironies include:
- Manual skills deteriorate when they are not used.
- The generation of new/novel strategies requires an adequate knowledge of the system.
- Automation generally requires a speed versus correctness tradeoff. (Computers are fast, but using them to automate other computers so fast that we can't validate it in advance.)
- Automation can camouflage current system state. When it reaches its limits it can be in a worse state than if we'd've done it by hand.
- Tracing the decision trees made by algorithms can be difficult (or, with AI/ML, impossible).
- This leads to the inability to fully understand the current context of the system when you get paged.
Automated systems, as they increase in autonomy and authority have two kinds of interpretation:
- As a deterministic machine (in foresight or hindsight).
- As an animate agent capable of activities independent of the operator (in context).
The difference between those is most critical in (during) an incident.
How do you deal with the ironies: Engage in cultivate the ability to buy time (through simulation, like chaos engineering, or widening the system understanding as a daily practice, like in brown-bags). Also, co-evolve the automation with the product/s it supports.
Takeaways:
- Expertise takes time and space, so if you want an expert team you need to make the time and create the space for it to happen.
- Beware our brains' biases. Success and failure are two sides of the same coin. If you haven't done a retrospective in 72 hours you won't learn anything because the reconstructed context isn't necessarily the event context.
- Automation that truly participates in our joint cognitive systems is nascent.
Caveat implementor!
Keynote: Why Are Distributed Systems So Hard?
The final closing keynote was Denise Yu's "Why Are Distributed Systems So Hard?" Basically, there's a lot of unreliability we have to work around. We can monitor, use chaos engineering, and so on... but we can't know everything. We do, though, know that "Shit's gonna fail."
Humans' superpower is empathy. Computers can't do that yet.
Closing Remarks
The conference ended with closing remarks from our conference chairs. Thanks to everyone who helped put together this year's conference. Please fill out the survey.
Some statistics:
- Build — We used 92% of our bandwidth and had 435 unique devices connected.
- Captioning — CART services were provided by Kitty Baca and Associates (thanks!).
Next year we'll be co-located with SREcon Americas East in Boston (December 7–9). Avleen Vig and Cat Allman are our co-chairs.
Evening Activities
After the conference ended I headed up to the room to dump the laptop, put on a sweatshirt, and grab my jacket before going back to the lobby to meet Dan, Jim, and Kent to go to Fogo de Chao for dinner. Over the 2.5 hours or so I had lots of meat — chorizo, prosciutto, salami, smoked salmon, and sopresata from the salad bar (along with artichoke hearts, asparagus, caprese salad, manchego cheese, and parmesan cheese), then alcatra (top sirloin), cordeiro (lamb chops), costela (beef ribs), filet mignon, fraldinha (bottom sirloin), frango (chicken) con bacon, linguica, medalhoes (steak) com bacon, New York strip, and picanha, plus the mashed potatoes, fried bananas, polenta, and cheesy bread. I had a caipirinha (and a glass of the malbec Jim ordered) to drink and the lava cake for dessert.
After getting back to the hotel I schmoozed a bit in the lobby then headed upstairs to pre-pack the luggage before collapsing in bed.
Today was a 21-hour travel day. Thanks to my back being unhappy I popped some muscle relaxants at bedtime, which led to dry-mouth, which had me awake by 3am. I put together most of my expense report so it'd be faster to file Friday when I had the last of the actually-invoked costs.
I started out with breakfast with Lee at the hotel restaurant at 7:30am. We were later joined by Peter and someone whose name, unfortunately, I can't remember. After my meat lover's omelette, breakfast potatoes, fruit salad, and extra meats (bacon, pork sausage, chicken sausage, salami, pepperoni, and lox), I went to my room to grab the bags and head down to the lobby. I let the auto-checkout take care of itself and caught my 9am shuttle to the airport at 8:55am.
Travel to the airport was uneventful, except for the right-wing talking heads on the radio. I tried to remain diplomatic and not shout "Bullshit" repeatedly. No line to drop off my bag, a very short wait for wheelchair assistance, through security (where they had me sit and remove the aircast so (a) it could go through the x-ray on its own and (b) they could pat down my leg), and to my gate in plenty of time for my flight. As it happens our inbound aircraft was coming from Detroit, the same flight I'd taken here on Sunday morning.
Boarding was uneventful. The flight itself was a bit bumpy. Lunch was tolerable; I didn't really like either of my choices so I went with the least-objectionable salmon salad: A salmon filet served chilled on a bed of mostly-arugula with shaved Brussels sprouts and purple potatoes. It was better than I was expecting, honestly. Landing was fine, the taxi to the gate typically long, and we off-boarded pretty quickly.
There were supposed to be six wheelchairs meeting the plane. None were planeside. When I got up to the gate area there were two or three, stacked on each other, with no staff there to assist. Remembering the 39-minute wait for just one wheelchair assistant in August, I elected not to bother waiting. I hiked from A30 to baggage claim, where despite the flight crew and the Delta app saying we'd be at claim 4 (to the left of the doors from the terminal), the posted signs said we'd be at claim 9 (to the far right of the doors). Of course, I didn't check the signs until after getting to claim 4 and not seeing our flight number on the carousel display.
After a hike over to claim 9 — broken foot and all — I got my bag (20th off the conveyor) and went over, up, over, and down to catch the shuttle to the parking deck, and went over, up, over, and down to level 3 to my car, and drove home through the rain as it became sleet then snow. I got home without incident, grabbed the accumulated postal mail, unpacked, and went to bed.