The following document is intended as the general trip report for me at the 32nd Systems Administration Conference (LISA 2018) in Nashville, TN, from October 29–31, 2018. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
This year was the first in a new conference program layout: Three days of technical sessions with no more than four tracks (2 invited talks (in either 2x45 or 3x30) and 2 trainings (in 1x90)), plus Labs (for hands-on post-training followups and generic testing). I had said I was going into this with an open mind and not a knee-jerk "this sucks because it's different" reaction.
Some of the reasons for the change are that we're smaller than we used to be, we don't do full- and half-day training any more (limiting training to a 90 minute block) and our room requirements have changed accordingly. We've eliminated referreed papers (which ended in 2015, when we accepted 4 of 13 submissions) and workshops (which ended in 2016), and now with fewer tutorials our room needs are 2 ballrooms for the talks, 2 smaller (50- to 75-person) rooms for the trainings, and a third similar space for the NOC/Labs/Build, plus a concourse for registration, a ballroom or expo floor area for the vendors, a hidden staff office, and possibly storage space. We're smaller in part because of the existence of other focused regional conferences (like SREcon) and the perceived reduced relevance in our content.
In general the worst I can say about the change is that it was surreal. For the past quarter century — literally; this was my 25th consecutive LISA conference — my brain's been programmed that the technical sessions were Wednesday through Friday and the reception was Thursday night. This year it was Monday through Wednesday with the reception on the Tuesday, so "what day is it" was a more complex mental lookup than usual. If we continue down this path I'm sure I'll get used to it.
That all being said I have some issues that aren't unique to the six-to-three change. I've observerd that whether the talks are 3x30 or 2x45, many times the speaker needs more time than they've been allocated. We had many sessions this year where there simply wasn't time for Q&A, and a noticeable delay as speakers swap out their laptops. I suspect it won't go over well to pre-hook up all three speakers' laptops in advance (what are they going to use when they're not speaking?), but maybe they could use the time to disconnect the first speaker's laptop and connect the second's for Q&A for the first speaker. It's irritating and inefficient.
Travel day! I was up early — the alarm was set for 6am and I was up by 4:30am anyhow. I showered, finished packing, made the bed, unloaded the dishwasher, set the thermostat lower, and hit the road shortly after 5am. There was virtually no traffic on the road to Metro airport, I found a parking space in my usual zone (3E4/3E5 of the Big Blue Deck) quickly, got to the terminal shuttle as one was pulling up, and got to the terminal with only one red traffic light. There was no line to drop off my checked bag, only one person ahead of me in the TSA line, and no huge backup at the scanners, so I was in the terminal and at my gate in plenty of time to catch up on most of my backlog of magazines before boarding began for my flight.
We were actually all on-board, seated, and had an early departure. The flight itself was mostly uneventful. Our lead flight attendant was a bit of a comedian (using expressions like "I don't usually say this but the drinks are on me" to the first-class cabin, and "flight attendants will be coming through the cabin to pick up items you want to dispose of, but we cannot accept small children or in-laws," and "be careful opening the overhead bins but remember that shift happens"), but he kept my kalimoxtos refilled. Other than a bit of turbulence on takeoff it was a smooth ride.
We arrived early in Nashville and my bag was somehow second from my plane onto the conveyor; there was no line for a taxi, and when we got to the hotel there was no line to check in. They even had my room ready so I could get upstairs and unpacked before heading back down to wander the conference space as they were setting up. Said Hi to the folks in Labs, snagged a donut (mm, breakfast — the banana and small bag of snack mix on the plane with the kalimoxtos didn't count), then headed out to find lunch. I wound up having a really tasty carnitas burrito at Bajo Sexto Taco.
After lunch I took a much-needed power nap before heading back down to the conference space. I wound up heading across the street to Martin's BBQ for an early dinner with seven other people (only two of whom I knew beforehand). I had a pulled pork tray with baked beans and mac-n-cheese and everything was really tasty. We got back to the hotel shortly after the registration booths opened so I got my badge printed and headed to the Welcome Reception where I schmoozed, nibbled cheese (mostly the blue), and basically networked until about 9pm when I decided to head to bed and try to sleep. Of course, I didn't succeed in sleeping until nearly 11pm, but oh well.
I slept in (not counting a bio break or two) until 6am or so. Did my morning ablutions, got dressed, and headed down to the conference space for the continental breakfast (I had a croissant with strawberry preserves and a banana) before the opening remarks which were scheduled to begin at 8:45. Met some new people, caught up with some old friends (including Amy, Brian, Cory, Doug, Jennifer, and Pat), and wandered into the keynote space in time to get a reasonably-close table with a power strip.
Opening remarks: Thanks to all the folks who helped put this together (from program to office to sponsors). We had a record-high 329 total submissions, which got 1106 reviews and 913 feedback comments, and we accepted 81 talks and tutorials. We also had a total of 7 keynotes on the program (though we had to scratch one because the presenter was ill). We were reminded to go to LISA Labs for hands-on learning about new technologies, to go to the Birds of a Feather (BOF) sessions in the evenings, and to go to LISA 2019 in Portland Oregon next year (October 28–30).
Our first keynote speaker was Jon Masters who talked about how the underlying hardware's attempts to be faster led to the Spectre and Meltdown attacks earlier this year. Part of the problem is that hardware and software people go out of their way to not talk to each other. We've spent decades making machines run faster but never asked what happened in return for that. He spent too much time (in my and others' opinions) refreshing the audience's understanding of the underlying computer architecture, microarchitecture, firmware, branch prediction, virtual memory, and caches. Once that was done he started talking about side-channel attacks and gave a quick overview of how Meltdown and Spectre worked. His recommendation for what's next is to change how we design hardware, change how we design software, and actually have the hardware and software people talk to each other.
Our second keynote was Tameika Reid on "The Past, Present, and Future of SysAdmins." This was mostly a skill-mapping talk and didn't seem very keynotey to me (and to some others I talked to). Her thesis was that people keep saying "systems administration is dying" but they really refer more to the title "sysadmin" since the skills we have — problem solving and analytical skills; virtualization; cloud; ai, ml, blockchain, and big data; communication; scripting and programming languages; repos; networking, DNS, DHCP, and SDN; automation (such as Ansible); performance tuning (such as with perf); testing; and security (hardware, software, physical, and social) — will still be relevant. Some new titles we may see include Blockchain Engineer, Chaos (or Intuition) Engineer, Infrastructure/Automation Engineer, and System Architect. We'll also be seeing more with quantum computing, the merging of security into DevSecOps, more IOT-related administration, and automotive-grade Linux such as with autonomous vehicles.
After the morning break I went to one of the invited talk sessions. First up was "Introducing Reliability Toolkit" about how ING started with the goal of improving their services' reliability and realized they didn't monitor well. Without direct alerting they would lose 69 minutes on average before major incident resolution even began. They measured business functions and most of their 300 teams didn't know their application-perspective's uptime perception. They built a toolkit based on Prometheus to create alerts based on a time series from a database; an alert manager to route those alerts to email, SMS, or ChatOps as relevant; Grafana for graphing; and Model Builder to compare the desired metrics to the actual metrics. They provision it for those application teams with default configuration files, templates, dashboards, and some alerts. Their role is to maintain and update binaries (test and release new versions, including both security and penetration testing). They deliver 5 machines (1 test, 2 acceptance, 2 production) so the customer unit can have nearly full responsibility; they also do OS-level patching. They also provide client libraries to scrape metrics from the servers; other teams also work on exporters and they're internally open-sourcing things to reuse others work.
They learned that having cool technology doesn't mean people start using it. Having the client libraries in the engineering frameworks help a lot, as does ensuring a good feedback loop with your customers (such as determining what's prohibiting or inhibiting them from using it). They educate during onboarding and workshops. Other teams may not understand a lot of the monitoring/metrics pieces (for example, ELK is not the same as metrics). Having dashboards accessible to all engineers has been helpful.
Their model builder knows that weekends and holidays have different expected traffic. Currently they only do averaging models. It's currently closed-source within their institution but they're looking to open-source it. Within their institution they effectively offer Prometheus-as-a-Service. Model Builder isn't available yet and they aren't sure how to release and announce it.
The second talk in this session block was "Incident Management at Netflix Velocity." Netflix sees hundreds of billions of events — like "click on the UI" or "play a movie," ignoring the actual video data — flow through them every day. They therefore have billions of time-series metrics updated every minute. Their goal is to provide "winning moments of truth:" When someone chooses Netflix for their entertainment. For them, Seconds Matter.
When they moved to the cloud they wanted to make sure that the disappearance of an (AWS) instance don't cause an application outage. Chaos Monkey helps with that: It's randomly clobbering production instances. But what else should they be testing? They're good at running software in the happy path (for example, A calls B, gets response, yay) and the sad path (A calls a dead B), but what about the gray area (where B responds only partly or corruptly)? They created Latency Monkey to insert latency at the common RPC layer. They put the current priority metric against it to see how cranking up impact and latency affected the customer success metric.
They learned things from their testing: Application behavior (during an applicatiob migration that used the same common layer), blast radius (affecting 100% of a service instead of some of it), and consistency (they went faster than they should've). They had to think differently about failure, knowing they'd scale bigger and higher, so future failures would be more complex.
So now they think about:
- Reasonable prevention — They can't stop testing, moving, and thinking. Don't overindex on past failures; consider what had to happen in a complex system for the failure testing to happen. Retries already happen; things are often in some level of failure to begin with. Don't overindex on future failures like premature optimization. Have a problem before you solve for it. Really, don't overindex in general.
- Invest in resilience — Choose explicitly to allow for resilience in new features. Codify good patterns; we sometimes find solutions to problems (e.g., a shared library that might be repeatedly clobbering a cache). Implement and invest more in chaos and get Engineering to buy off on it.
- Expect failures — It will, regardless of what measures you take. "100% uptime" simply isn't possible. If you expect and plan for failure then you're able to prioritize it. It's a matter of WHEN, not IF, the system fails. Recovery is more important and a greater strategy than prevention. Consider what should happen when something isn't available (AWS region issue, microservice like personalization being down), so build in a recovery mechanism (e.g., evacuate a bad AWS region, have a fallback plan).
Given all that, what's their Incident Management look like?
- Have goals and expectations. Be ready when things break. Their goals and process include having short and shallow incidents, measuring so you can determine if a change helped, having unique failures (if you repeat a failure that indicates a lost lesson), and that outages should be valuable because they're expensive, not just financially.
- Have experts. "You own it, you run it." But not all software people are experts in everything especially in the edge cases we don't see much, so they created a team of Failure Experts to help the other teams during those problems. (Those people are the Core SRE Team.)
Their tactics are:
- Before incidents — Set expectations; what's obvious to ops (e.g., what you need for on-call) may not be to non-ops. Do education and outreach, both inside and outside the team.
- During incidents — The Core SRE team doesn't fix things, but they coordinate (the team caused and is fixing the problem is easy; multiple teams involved — especially when one or more of the PR, legal, marketing, and the executive teams need to be involved — makes this more complex). Communication is also important. "No news is good news" is horribly wrong when things are broken. Communicating with engineers is different than with customer service.
- After incidents — Memorialization, going back and learning what happened and why, what mental context did you have, what tools were you using, how did we get into this failure state. Blameless AARs.
Lunch was held on the vendor expo floor. The menu included crudités with pimento cheese; music city salad (romaine, hard boiled egg, and some other stuff I forget); couscous with feta, tomato, red onion (add bacon, red wine/shallot vinaigrette); sliders, both cheeseburger and fried chicken; and cookies and chocolate tarts.
The first block after lunch had three talks. The first was Brandon Bercovich;s "Designing for Failure: How to Manage Thousands of Hosts Through Automation." He believes there are three problem spaces (limiting the discussion to stateless services):
- Resource utilization — They started with no containerizaton, manual placement, and manual rebalancing. They then went with containerization (Docker) but didn't change the manual placement/balancing issues. Finally they went to Apache Mesos and Apache Aurora to auto-schedule and -balance serviecs as resource requirements change. No more paging the admin for failures because there's no manual intervention needed. Mesos is largest-scaling in open source now; OpenStack and Kubernetes are other possibilities.
- Fleet management — Started with Puppet but it didn't do all they wanted it to (part due to setup and part due to product limitations); it wasn't sufficiently idempotent (e.g., upgrade agent AND the sudoers). They they built Platform Lifecycle Manager (PLM); having a Puppet Master would fix [some of?] their problems but they ran masterless. It lets them do canarying and control the rollout. They didn't pay attention what was running where and took down all the hosts for a specific service. The team performing the work weren't closely involved with the tool developers so there were issues there. They next went to Cluster Lifecycle Manager (CLM) — the same thing only rebranded, and redesigned in collaboration with the tool users to solve the right problems the right way and business rules/requirements.
- Host management — What if hosts are deferred, or if they're down when the thing happens. This is more determining how a host gets the goal state when it comes back up. CLM detects the differences between actual and goal states, hands it off to Cadence Orchestration Engine to send the remediation commands to the relevant host(s).
Upgrades are good, but what about host failures? The detector can look at other conditions (was it ssh-able? did a reboot fix it? if not, what's the escalation, an RMA?). They were able to get rid of hundreds of alerts in any given week by doing this.CLM's dispatcher is a combination of business rules and rate limiting. Its workers do the work as commanded by the dispatcher. It is bootstrapped with a simple foundational stack, manually placed. (It's their only manually-placed service.)
The second talk in this block was "Familiar Smells I've Detected in Your Systems Engineering Organiztion... and How to Fix Them" by Dave Mangot. Why "smells?" It's one of the strongest memory triggers we have. A lot of this was common sense — crawl before you walk before you run, keep staging like production, script or automate what you can (including failure resolutions), and so on. (I found the repeated musical riffs both distracting and annoying (and possibly unlicensed). His playing a sad two-tone on the harmonica didn't add value either.)
The third and final talk in this block was Michael Kehoe & Todd Palino on "Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way." Their thesis is that any A team can declare a "code yellow" state when they're in a very bad spot (SRE toil, bad smells, etc.). Think of it as a yellow card in soccer. They talked about how to identify team anti-patterns (blame-free, open and honest communications), how to work through high toil, and how to create sustainable workloads. They walked through a couple of examples and their key lessons learned were to measure (toil and overhead), prioritize (to remove the toil), and communicate (with partners and teams). Getting into a code yellow needs to have both a problem statement and exit criteria, and management (of both SRE and Development) has to buy in and be involved. A code yellow is basically a breaking point where current processes aren't sustainable. (Managers can sometimes see these coming, but not always.) Measure; look at trends. This is very much a cultural issue for the organization; SRE or operations needs to have the ear of executive management at the same level as product management does. The communications — both what is and isn't going well — is essential.
After the afternoon break I went to the training session "30 Years of Making Lives Easier: Perl for System Administrators" by Ruth Holloway. She gave an excellent overview of the state of the Perl ecosystem, what tools are out there, and provided a repo with some examples we could view in real-time. She had a few general tips which made sense and are good practice.
After sessions ended a group of us (me, Branson, Duncan, Jen, Kathy, and Mark) headed out for dinner. Four of them hadn't yet gone to Martin's and the two of us who had didn't really care enough to object so we went there. Branson and I split the Big Momma platter: 4 spare ribs, 6 oz. each of brisket and pulled pork, and 4 sides (baked beans, coleslaw, fries, and mac-n-cheese). Then while the other 4 went back to the hotel Branson and I ran to the liquor store on 3rd Street to see what we could get for a small 2-bottle scotch BOF. We wound up with a bottle of Glenlivet 15 and a Basil Hayden's dark rye with port finish.
We got back to the hotel in time for the 5-minute Horror Stories BOF. I kept time — nobody went over 4 minutes so we could have everyone who wanted to contribute do so — and our winner was the gentleman with the "pigeon caused the data center fire" story. (His prize was the applause used to determine the winner.) I didn't contribute a story since nothing was of sufficient horror to warrant it, though many of the stories were more funny than horrific, so in retrospect I could've told the "Software wouldn't install because our hardware was too good" story from March 2009.
After that broke up around 9pm Branson pinged the Labs staff and a couple of others that we were starting and eventually six of us sat around in his suite drinking the booze and chatting from about 9pm to midnight. I left around 11:45pm to go to bed and was there by midnight.
Despite my late bedtime my body insisted on waking me up for good a bit after 5am. Managed to kill some time working on the trip report — writing and revising the trip report entries for yesterday and putting my likely schedule in as a template for the rest of the week — before shaving, showering, and getting ready for the day.
The continental breakfast in the conference space opened at 8am; I repeated my croissant-n-jam and banana. I caught up with Amy, Dan, and Doug, and made it to the keynote session in time to grab a seat with an available power strip.
The first block was another set of keynotes. There were supposed to be three, but the scheduled first speaker was sick so the remaining two speakers each had a 45-minute slot. And they needed that extra time.
The first was Dr. Sarah Lewis Cortes talking about "Anatomy of a Crime; Secure DevOps or Darknet Early Breach Detection." Starting with questions to gauge how much of the audience had heard of terms like darknet and Tor she talked about the Roman Seleznev case as background before a technical deep-dive into a breach. We saw darknet credit card search results (number, CVV, and expiration date in cleartext) and a skimmer. She talked about a retail hack: Burp Suite to crack passwords — it's even easier to buy credentials on the darknet. An event starts with an attacker getting credentials (such as from a supplier or password crack). They then target large retailers because they have hundreds or thousands of point-of-sale (POS) systems on which the malware can be installed.
The breach timeline:
- Apr 2013, obtain the credentials.
- Jul 2013, release a zero-day to exploit the vulnerability and jump the barriers between the vendor and retailer systems.
- Apr 2014, they had over 2200 self checkout POS terminals. The malware reads the cleartext payment card data from the RAM on the POS terminal (because nobody would see RAM between entry and verification, right?), uses a regex to grab the PCI, and sends it to the attackers' servers.
- Jun–Nov 2014, the card information is on the dark markets.
- Sep 2014, the breach was publicly detected (and reported by a bank to Brian Krebs — banks are the big customer buying back the credit card numbers because it's cheaper than dealing with fradulent transactions in bulk) by seeing it on the darknet.
- Nov 2014, phishing took place because the email addresses are also out there.
As of Oct 2017 the breached-customer PII is still on the darknet.
She gave several more examples of breaches before going into the fundamentals of the darknet: Tor, JonDoynm, I2P, et al. Tor is a volunteer network to bounce traffic through (typically) 3 relays so noone knows the true origin or destination, maintaining anonymity and encryption (though it's in cleartext in the final hop). Legal uses of Tor include privacy. Empirically however most of its traffic through is criminal in nature. So the darknet is effectively an overlay network on top of the Internet, using an alternative Internet addressing scheme to DNS with untraceable IP addresses. It was originally funded by the US Navy, went public in 2004, and now has 2M sessions over 6300 nodes.
So what can we do about this? Segregate networks between corporate and POS networks, keep credit card numbers (and other PII) off other spaces. Restrict access (which is hard with self-checkout), encrypt the applications and communications, and don't keep the stored data.
The second talk was "Do the Right Thing: Building Software in an Age of Responsibility" by Jeffrey Snover (incidentally the inventor of the now open-sourced PowerShell). He started by asking, "Why do we need to become better engineers?" Software affects everything, either the product itself (as in Amazon versus bookstores) or enhancing value (such as in a car). We're engineers. Where are we going to be in 40 years? We already have planetary-scale cloud infrastructure, 2.5 billion cellphone users, and there's an expectation of 6 connected devices per user in 2020 with smart homes and smart power grids, so the trends are that software is driving more and more systems.
If software eats the world, where does it come from? Engineers! So we need to ask ourselves, if we're building the fabric of the future what kind of world do we want to build? You are responsible for the moral responsibility of your actions and your code. (We — especially as software engineers — can always say No, or leave for another job. We should also consider if a company's business model and model standing align with our own before accepting an offer.)
What are the issues? We're effectively in a fourth industrial revolution with big data (after steam, electricity, and electronics). Technology disrupts everything; issues include job displacement, community safety (as opposed to surveillance), income inequality, and unequal access. AI can increase a country's GDP, but also replace either full jobs or major job tasks (depending on the job) via automation. The tension between technologists and society led to rules for design, training, safety, and so on.
How do we move forward? We need to step back and play the long game: Consider what's for the sake of the technology, of the business, and of society.
(As a sidebar, it is not a legal requirement for a company to maximize shareholder value. That idea came up from the 1970s. Maximizing shareholder value is just one of the important things, but shareholders are only one stakeholder. Companies should maximize for all constituencies — employees and customers should be considered as well. Maximizing the benefits for all constituencies is difficult but necessary.)
Are we building for all of society? Note that disabilities can be permanent or temporary. Some of his examples included:
- Touch — One arm, broken arm, or new parent carrying a baby
- Sight — Blind, cararacts, or a driver who shouldn't be distracted
- Hearing — Deaf, ear protection, or someone like a bartender in a noisy space
- Speach — Mute, laryngitis, or heavy accent
When in doubt, focus on solutions that amplify human dignity. At Digitial the one rule was "Do the right thing." They trusted that each individual moral action would do the right thing... and had both the permission and the responsibility to do that.
When building AI, is it inclusive? What about confirmation, dataset, associations, automation, and interaction biases? In fact, "Artifical Intuition" might be a better term; it's not true intelligence. Bad data in means bad results out. AI has no moral compass or empathy and cannot be a moral actor. (Not all people have human empathy: psychopaths.) AI is really "psychopathic intelligence" because it has no empathy. We therefore need to consider boundaries around what we will let AI do.
Bottom line: AI will cause harm. That said, it'll be less harm than if we avoid it. We can get models that work (such as Underwriters Labs (ul) for all electrical devices). This is functionally trust engineering: They've engineered trust into the system; you'll buy electronics in the US without worrying about electrocutions. We may need something like that for AI.
We all have a role:
- Acknowledge the issue. We hae an evolving dilemma and are responsible for our and our products' moral acts. We need to shift focus form shareholder to stakeholder.
- Lean in and educate yourself. Participate in the industry, social, and political discussions.
- Engineer to improve yourself, your team, and your product. Build systems to find and address problems.
DevOps is really the answer: Hypotheses, testing (true: go on; false: stop; or didn't matter: meh), iterating small batches quickly. That's process engineering by definition.
In the second block of talks we had another cancelation. What was to have been "The Divine and Felonious Nature of Cyber Security (A DevSecOps Tale" became instead a DevSecOps panel of four people with backgrounds in one or more of development, general operations, security, and web operations. After a quick round of introductions they defined DevSecOps as integrating secure processes into DevOps. What does that mean to the panel? For one it's integrated; they're all part of the same world. Another says that security should be embedded with their daily DevOps folks and get involved early in the project or process. For another it's an approach: What do I need, what fundamentals do I need to consider to make this Thing more secure. The goal is to enable engineering to do what they need and want to do while considering security (and automation and upgrades) from the ground up. DevOps breaks the "Ops always says no to Dev" and "Sec always says no to Ops" models. There was consensus that using the "Yes and..." approach from improv helps.
It was noted that there is a difference between those enforcing compliance per some standard or spec and those implementing it. That's one of the challenges between Sec and [Dev]Ops. It's shifting to the better as time goes by; you can't just be a security person with a CISSP and no longer do some level of the implementation. The security person needs to be able to build a Thing and make sure it's compliant, not just throw something over the fence.
It was also pointed out that "security comes in to do it once" and "security as compliance" are different. The latter is an iterative process... and iteration is very devopsy. Bringing empathy into the security world, like it is in DevOps, is a good thing. It takes a learning organization to do iteration to grow the business. Automation helps move away from the "secure an individual Thing" model.
At the end of the day Security is there to make sure that Bad Things don't happen. So given all that, how do we in practice improve things? "Good ops is good security" and "Dev has been trained to have other priorities than Security." Any implementation needs to be constant and tested.
Having Sec sit near Dev[Ops] helps... and having them talk to each other helps even more. What's a problem? Authenticating users. Build a library to do it in one common and secure way to let the developers focus on the "interesting" stuff. Enable transparency; let Security provide Stuff to Dev to make things better.
It's not malice; it's ignorance. Different specialities have different priorities and knowledge. Automation is better, communication is better, and education is better. Does Dev know about this Sec issue? Does Dev think about possible vulnerabilities or exploits at development time?
Now that we have standards... how do we test all of this together? Regardless of the CI/CD tool, how do you do test it? Is chaos engineering the right answer? Try to break everything; it's a common way to expose fault in systems. There are a lot of unknown unknowns. Snapshot your 2-year-old AWS instance then kill it and look at what beaks. Some people won't believe there's a problem until it truly doesn't work and impacts business.
How people interact with each other has changed. "We can work better together as humans," regardless of job function or tools. Where would this not work, environment wise? For it to work in government there's a push for high-impact secure teams including PMs and security people, to go in and restructure. (CDC and FBI are pushing this: Get a 10-to-12-person team to fix things in 6 months before moving to the next project.)
Strict separation of duties can impact this. You need to figure out the workflow for what they're trying to accomplish, and involve all parties to meet the spirit of the law. You want the benefit of multiple people without slowing things down unnecessarily.
You also really need management to buy in to giving [dev sec and ops] the cycles to implement it properly. Don't focus on the org structure necessarily; start with the low-hanging fruit to drive discussions and moving forwards.
The second talk in this block was Robert Allen's "How Houghton Mifflin Harcourt Went from Months to Minutes with Infrastructure Delivery." Houghton Mifflin Harcourt is more than publishing; his area looks at improving the K-12 learning experience for students and teachers. The problem is that it took 12–16 months to get infrastructure. They missed a business window; their competition was leaving them behind.
Their top five goals were:
- Focus on what's core to the business.
- Developers own it from concept to legacy/maintenance.
- Continuous delivery, many small change-sets. (Used to be 6 weeks at a minimum and 6 months was typical.)
- Stop trying to prevent failure.
- Decompose service responsibilities.
That led to the creation of Bedrock Technical Services, the foundation under everything they do. They spend the time making Bedrock safe for engineers, not making engineers safe for Bedrock. They isolate the developers' chaos from other influences (and have been generally successful). They use Mesos (and a little bit of containers) to do it. That gives them the flexibility for a large pool of resources for teams to function in with reasonable isolation. There have been some cross-over impacts (and they're working to reduce risk there).
It's really a philosophical change, not a technical one alone. By decomposing responsibilities and allowing mistakes and failures they're not flexible to do things more quickly. A lot is based on behavioral design: deliberately creating and delivering tools that make the RIGHT thing the EASY thing. When you can't make the right thing easy you must instead make the wrong thing the increasingly difficult course of action to take.
They set expectations... for applicsations,people, vendors, infrastrucutre, pets, and selves. They then inspect to make sure those expectations are being met.
Remember that all of these problems are truly people problems and you want to improve everyone's life experiences. They have on-call but are picky about what they're on-call for. Work-life balance is important.
After the lunch break — grilled sandwiches, both with and without meat; green, fruit, and potato salads; and a selection of desserts — I went to the training session "A Hacker's View of Your Network-Analyzing Your Network with Nmap." I didn't take copious notes, but it was good exposure to a tool that I knew existed but haven't ever had to do much with.
In the final block of the day they had three 30-minute talks. The first was Jonah Berquist and Gillian Gunson on "MySQL Infrastructure Testing Automation at GitHub." They test restores via Kubernetes now. What if they fail? Once, who cares; a lot, investigate in depth. By doing this they know how long a restore takes and can get a window into recovery times.
They also test online schema migrations... but they can't use trigger-based migrations because the tables were too busy and the lock requirements made Bad Things happen. Now they use gh-ost which uses binlog-based migrations (and binlogs are on both master and replica, so there's less load on the master), and because they write on the ghost table there's no lock contention with the original table.
In summary, trust your infrastructure/backup/automation if you can test it. Automate the testing. Build tools that can be tested in production by robots.
The second talk in the block was Tiffany Longworth's "Change Management for Humans." Her goal was to build an argument to compile in people's brains and run on their meatware. The model used here is ADKAR:
- Awareness — Interview all the steakholders — and ask, don't talk. It's problem identification, not requirements specification. Follow up after with what we will and what we won't do. It helps build evangelists not enemies. Give credit to their ideas (they get to feel more ownership). Remember to look out for new stakeholders, too.
- Desire — Identify the problems (tigers). Make your argument visceral. What's recent pain, future threats, changes in compliance (such as GDPR), competition, misalignment with self-identity, and so on? Also identify the good things (puppies). What do they care about, what makes their life easier, and so on? "Better data quality" might not be the goal but "going home early" is. (It's audience-specific: Remember to get the "what's important" out of those interviews.)
- Knowledge — How to do the change — needs to be in the middle. There are different learning styles (reading, logical, visual, auditory, verbal, hands-on, solitary, group)
- Ability — (1) Will management let you do it? (2) Do the people believe they can do it? Show a team or individual who's already successful doing this. The simpler you can make this the better. Subvert impostor syndrome; let them know it's not their fault when there's a problem.
- Reinforcement — Now that it's launched, you need to show improvement. Iteration announcements, metrics, follow-up checks, on-boarding, Eol support for what this replaced.
The third and final talk in this block was Tom Limoncelli's "Operations Reform: Tom Sawyer-ing Your Way to Operational Excellence." He told three stories:
- Big initiative — "If you ignore them they go away." Ops can say No. They're underfunded, overloaded, and have too much history and complexity. Look for alternatives to top-down edicts.
- Google's operational hygiene asessment — In 2009 Google realized that their ops hygiene was uneven: Backups, monitoring, and so on are understood to be needed. You can debate frequency but they need to happen. Top-down wouldn't work so they did an assessment where a team filled a spreadsheet out monthly with how well (green/4) or not (red/1) they were doing that month. That got simple and easy wins, created good culture, encouraged improvement, and helped people understand cross-team impact.
Note this includes non-monetary recognition of good work. Reward improvement not state. Encourage your high-performers to leave high-assessed teams and join the low ones because they have the skills and expertise and won't move if it means a pay cut (or lower bonus).- Stack Overflow's operational hygiene assessment — Applied the Google technique from (2) to new job as new manager to identify problem areas. But Stack Overflow is smaller, they had 1 (not hundreds of) SRE team, so scaled it down. They used just one spreadsheet with the SREs using their own rubrics and used pass/fail not 1..5 — then they added A* for excellence and F* for disasters. Started slow with 4 of 9 desired categories, and added the other 5 later. Worked because it was simple, blameless, let the users motivate themselves, and visible and understandable to management.
They identified a lot (as in hundreds) of tech-debt projects and fought the temptation to ignore feature projects. Use projects that you believe will improve things and see if the color changed. Measure measure measure. Solutions included:
- Rate limit — 20% of project hours on tech debt.
- Theme month — January is backup month, February is monitoring month.
- Theory of constraints — identify where the constraints or bottlenecks are and automate or fix or otherwise improve at that bottleneck.
In summary, do it yourself! Go home and apply this in our own organizations, and tell him how it went. And maybe even put together a talk for LISSA19. But do it grass roots, one team at a time, evolve the rubric over time, provide resources and use a spreadsheet not custom software, be blameless and transparent.
Based on an audience question, his advice for those in toxic environments was:
- Start small but keep transparent within the team.
- Stay blameless (he suggested Dave Zwiebacks' Beyond Blame; Learning for Failure and Success.
- Send out your resume.
After the sessions ended I headed up to my room to drop off the laptop bag (and the swag I snagged from the expo floor for the cow-orkers) before heading down to the reception. Due to some poor planning on the hotel's part, the only level path from our area of the hotel to the Broadway ballroom where our reception was held was blocked by another group (a formal reception with a plated sit-down dinner for Gilda's Club) who didn't want several hundred geeks in schlubwear wandering through, so our instructions were to go down to the first floor lobby, walk down the hallway, and then go up the spiral staircase. Great for the able-bodied; not so great for those with mobility concerns (and no elevator on that end of the floor). As an able-bodied formerly-less-mobile person I raised a stink. Usenix staff quickly got the hotel to realize there was an issue and their fix was to have a badged hotel employee escort us through the GC space to our own reception without bothering with elevators, escalators, or stairs. (By the time we were leaving the other group was inside their own ballroom and not forced to interact with us plebes.)
The reception was a reception. The schtick was the print-your-own conference t-shirt area: You told them the size shirt you wanted, they handed you a blank gray t-shirt, you got to a silkscreening station, one of their staffers fitted the shirt onto the form, they added the white ink and instructed you how to press the squeegee down and run it across the screen, you did it (possibly with their help), they carefully fold it ink-side up, run it through a tiny 360-degree oven to set the ink, roll it up, and you walk away with a conference t-shirt. Cute. The lines were longer for the shirt than the food (not that they ran out of either).
The reception food was okay: A decent salad bar with reasonable lettuce and vegetable options; small-portion fried chicken-and-waffles skewered with either pickles, okra, or both, with bourbon maple syrup and hot sauce available (good, though the waffles tended to get cold quickly — duh — so I wound up asking for a few pieces of Just Chicken); and salted caramel chocolate tartlets (good, though the caramel was a bit runnier than I was expecting).
(A few of us left early to try to get in for a quick bite at the attached steakhouse; they were booked and couldn't get us in [and out again] in the time we needed since someone had to run a BOF at 8pm so we went back to the reception to eat dinner.)
After the reception ended and most folks went on to BOFs, I wound up back in my room briefly to change into a swimsuit before heading down aroudn 8:30pm to the outdoor heated pool and the hot tub. Around 9pm or so Tom joined me there and we caught up on Life and Stuff and Things before bailing a bit before 10pm.
I got back to my room, showered off the chlorine, rinsed out the suit, and pretty much went to sleep.
I slept in without an additional bio-break until shortly after 5am. When taking care of biological necessities I managed to stub my big toe on the bathroom door enough to keep me awake, so I caught up on email and skimmed my social media before another brief shower. Trimmed the fingernails (since if I didn't they were just long enough that the odds were one would chip or tear before I got home, and with my luck that'd happen inside the airport security perimeter where I wouldn't have access to my clippers), got dressed, and headed down to the conference space for yet another continental breakfast; this time they had bagels so I had an everything bagel with schmear and a banana.
Day three (and last) of the technical sessions kicked off with a three-talk block. First up was Paul Carleton's "How Our Security Requirements Turned Us into Accidental Chaos Engineers." Their problem was that old (in terms of time since launch) cloud instances would have more problems over time. Breaking changes cause their replacements to have problems. Replacing them more frequently can reduce that risk. Security can also be an issue; compromised old hosts are more likely to wreak havoc on your environment, and post-patch instances are good but how can you be sure the older instances truly got patched?
Their solution was to create Lifespan Management (LM) to enforce a maximum lifespan and replace things when they reach that. There's an autoscaling group to add new nodes and make sure they're healthy, a terminator service to clean up old ones gracefully, and an enforcer that makes sure old nodes that were told to terminate themselves actually do. They went with terminate-first instead of build-new-first to enforce resilience. They came up with a rollout plan, labeling things as to their function: Stateless (safe to replicate), stateful automated (can be done), and requires operator (not safe for automation). They labeled things and rolled it out. What could possibly go wrong?
Well, they learned what could possibly go wrong. Specifically they learned 5 chaotic things:
- How not to health check. The manager terminated all the LDAP servers and lcoked out the QA environment. The nodes were actuallky up, just not running LDAP any more. The health check was "Are you healthy?" and initially got the response "I'm running LDAP and not in maintenance mode." After termination it got the response "I'm not in maintenance mode." So they learned that "zero unhealthy" is not the same as "healthy." Setting explicit expectations — in this case, "what's your LDAP health?" — is better.
- RIP K8s workers. They were going down hard without the "wait for graceful shutdown" part. Why? There were two different terminate API calls. The one for laptops did the right thing, but the one being used in k8s effectively yanked the power plug. They had to track feature usage to make sure the shutdown completes fully.
- Blackhole scenario. The LM termiator can send heartbeat ("wait"), or a thumbs up ("okay to kill me"), but didn't allow for a "cancel" operation. Nonzero exit, timeout, and rate limits (losing heartbeat) all caused issues. Going to a 2-touch terminator — do it manually, then tell AWS to terminate — worked. Align the incentives so people will adopt it.
- Self-service meltdown. When Spectre/Meltdown happened they told people "If you don't want to deal with this in the future, consider Lifespan Management." They had issues with turning it on (here's a shelf-foot of docs), not knowing who has to do what parts (and asking got different answers), and false alarms (blaming LM for anything because it was the new thing).
Moving from a wall-o'-text to a flowchart covering 80% of cases took care of the first issue, with a "consult with us if you think that you're an edge case." They built a nice who-does-what chart with a timeline (which met most concerns) and the other people now knew what they had to do. For the third they put together a dashboard for why/what clobbered something that aggregated from multiple sources.- Death by a thousand JIRA tickets. The LM would turn its warnings into tickets... not knowing about the number of warnings. The support team didn't know how to close, route, or handle them. File tickets to themselves before automating to the production queues. The 1% case (of failures in the building) matters more with ten times as many terminations, and they needed to measure both the quantity and reliability of tickets (to not overburden the other teams).
The big takeaway: What automation problems can you cause with a little chaos? Do you know how old your instances are?
The second session in the block was Pat Cable's "Securing a Security Company." If you learn nothing else, secure means nothing without context. In the security world this is threat modeling. Securing a safe is different than securing a donut shop. Constraints are real; so there's no such thing as 100% secure in reality, especially for small companies.
The conflict between "Gotta go fast" and "Secure all the things" led to their security program's evolution, including access controls, cryptography, compliance, and chaos events. To get started with doing this kind of stuff in our environments start small. Identify your highest risks and chip away at them. You won't reach your destination overnight. Bring your teams together: People, then process, then tools. Build tooling for automation, especially with UI improvements (security-made-easy is a quick win). Figure our your own context: Be aware of how it changes and how events may change it.
The third session was Sabice Arkenvirr's "We Already Have Nice Things, Use Them!" about the pitfalls of rolling your own tools (speaking a a consultant looking at lots of companies they were at). Examples are monitoring, config management, and deployment. Why build your own? Reasons ranged from "Nothing was available when we wrote this" to "We don't like reading manuals" and everything in between. Only the first is acceptable... until the tool exists Out There.
There are operational risks to rolling your own: What if the developer leaves? What're the docs like? What's the lost momentum for a new hire? What's the company's core reason for being (usually not "writing our own tools")? Using industry standard tools helps address all of these.
All that said, when rolling your own, consider:
- Polling the greater ops community for solutions
- Comparing the costs of proprietary tools to the estimated engineering time to write and maintain an in-house solution
- Identifying open source solutions, even those without desired features, and contribute to them
- Forking any open source tools that're well-written but unmaintained
If you do have to roll your own, apply the same testing and documentation standards to it as you do to your own operations tools. Keep it small and simple. Allow enough time in sprints for feature requests [and testing and documenting].
Acknowledge that in-house is a short-term solution up front. Be explicit that maintaining the in-house tool is not the desired goal. Allow time to research open source alternatives regularly (quarterly? annually? depends on your environment). Know that testing and implementing what you found in that research takes time, and estimating timeframes can be difficult (and inaccurate).
Now that you have your tool you can open source it. But you should have a community manager. (This isn't feasible for smaller places.)
After the morning break, for the second block of the day I went to Branson Matheson's two-topic training session, "How to Write Effective Training" and "How to Give Effective Training." His stated objective was to get more trainers for LISA conferences. For those who haen't done this, what are better tools for training? What troubles that people have had when training can we address?
Know who you're teaching to (and their learning styles). As a process:
- Start with an idea:
- Wordsmith and brainstorm off of it (wordcloud/mindmap/...) — refine later, write now!
- Develop it to an objective.
- Highlight key ideas and cross out the ones that aren't as applicable.
- Go back to it all a day or two later.
- Develop the core idea.
- Scope it, incl. time and audience constraints.
- How well do you know the material? Can you teach it successfully?
- How hard will it be to develop methods to teach the concept?
- How long will the class be (90m, 8h, 2d):
- How many students will you be able to support?
- Will there be co-instructors to help?
- Is there a known venue? Will it fit?
- What exercise formats (individual/team, cooperative versus competitive, open discussion versus partner review)?
- Build the outline.
- Create the presentation:
- Plan exercises around breaks and lunch if needed.
- Leave time for discussion (not just questions).
- Exercises — keep 'em short, after questions, make them completable with what you taught (including prerequisites), be self-contained, and be relevant to the students' workplace.
- TEST TEST TEST!
- Make it awesome. A slide should be a story and short; slides are data reinforcement, instructor teaching is providing the material.
- Write AND SELL the proposal. For students: What do they need to know, have, and take back/away. Tell why you should be teaching it as well. End with a good tag. Do your biography as well (but last; include who not just what).
That segued into the second part, which can be summed up with:
- Think — Giving a class you're the performer, on stage with an audience. Class should be engaging and entertaining... and keep the students' attention even when it's hard (tired, after lunch, brain full, ...). Leave a lasting impression. Teach the topic and encourage the student to explore afterwards.
- Prepare — Audience, location, standards, expectations, motivations, language (from you to them), cultural references, timing.
- Execute — Stage, engage, reward, resolve.
After lunch — with no expo floor today, we did a buffet line of barbecue (baked chicken and smoked brisket, with both black eyed pea and green salads, cornbread, and mac-n-cheese then got to sit down in the other ballroom — I went to the third block of three talks. First up was Ivan Ivanov with "Mastering Near-Real-Time Telemetry and Big Data: Invaluable Superpowers for Ordinary SREs." I didn't take any notes because his slides, which will be available on the conference website, was a lot of links out to other material.
Next up was Nicolai Plum with "Datastore Axes: Choosing the Scalability Direction You Need." His theme was things to consider when providing a service for developers to build and support their own applications. Technologists should know what their business is doing and planning so they can plan and prepare better. Graphing data growth, schema growth, read queries, and write queries on a single graph lets them analyze the ways your needs are likely to grow and use CMM-type data modeling to figure out what's the right way to move forwards to futureproof Life and Stuff and Things.
Last up in this session was Andrea Barberio and David Hendricks with "Make Your System Firmware Faster, More Flexible and Reliable with LinuxBoot." Basically, firmware needs to do so much more it effectively needs to be a full and secure OS stack... so why not use Linux? See "LinuxBoot." They start with Coreboot for silicon initialization, then LinuxBoot to deal with storage and networking. Can be extended beyond firmware, such as to be a bootloader or OS installer.
The final block of the conference began with closing remarks from the chairs. As they talked the projectors displayed a single slide from each speaker summarizing their talk or training. They thanked us for attending and staying until the end, thanked the staff again, and asked people to stand if they were at their first LISA, or if they attended more than five or more than ten (but not "20 or more" or "25 or more," both of which I could've stood for).
The rest of the block was two 30-ish-minute keynotes. First up was Tom McLaughlin with "Serverless Ops: What to Do/This Is What We Do, When the Server Goes Away."
DevOps is more than pulling down the silo walls; it's forming cross-functional value delivery teams aligned around a business feature. Historically Ops was generally left out of this kind of reorg. If a service is considered or considering serverless as the default then Ops isn't needed and all we can do is play tollbooth operator... or be getting out of the monolithic Ops organization and joining the value-delivery teams. This is really what we should be aiming for: How do we deliver more efficiently more quickly to our customers, users, or patrons?
The focus needs to be less on the server but more on the service, and on the user being happy or satisfied. To accomplish this we need to code (to understand the code triggering the failures that we need to remediate).
Doing DifferentOps takes up some time; what about the rest of our time? Game day drills, chaos engineering, and other disaster drills? It's new-to-you work that you haven't had the time to experience yet.
Serverless is more FinDev than DevOps because cost is becoming the first-class metric. Every component and change has a cost (positive or negative), and serverless billing lets it be more easily tracked. We also care about revenue, not just cost. This means understanding how the organization works and sustains itself, and what you're trying to accomplish. Focus on the organization's unique properties and problems, not explicitly on ops and administration (and be okay with what you do and why (referencing Jeffrey Snover's talk).
The mindset shift from "How do I..." to "Why do I..." is the necessary bit. Many of us got into SA to solve problems not to just administer servers.
Last up was Nora Jones' "Chaos Engineering: A Step Towards Resilience" keynote. Nora's view on chaos engineering (CE) is different. She's done CE and has seen the benefits of doing it.
What is it? Building confidence in understanding how systems work and fail, exploring the unknown unknowns.
Why do it? Not to find vulnerabilities or reduce outages (per se), since prioritization matters. We want to measure RISK not ERRORS or OUTAGES. You want to minimize the risks. The goal is to be more prepared for turbulent conditions when they inevitably occur. A goal is to be resilient: Adjust the system to maintain normal functionality — and not just the server as system but the people and process and culture as well.
Doing CE is not a silver bullet but it's part of the solution suite for reducing outages and risk and so on.
How can I increase confidence in resilience?
- Unit testing — Validate the individual component (given input A I expect output B).
- Integraton testing — Validate the components work together.
- Chaos testing — Validate assumptions inproduction by injecting latency or failure between service calls.
When you fnd a vuln, find out what happened:
- Culturally — competing team priorities?
- Technically — how difficult was the change to implement and were shortcuts taken?
- Attitude-wise — What external factors might've changed things?
CE isn't to CAUSE problems but to REVEAL existing ones. And when you find something you need to actually fix the underlying problem.
At Netflix their focuses are:
- Safety
- Observability
- Automation
They use a tool called ChAP (Chaos Automation Platform). Consider customer impact when experimenting. Setup involves:
- Injection points (and they have many)
- Automated Canary Analysis Configrations (ACAs, and they provide canned ones to aid in setup)
- User-added stuff (requiring understanding of ChAP)
Goal is to chaos all the things and all the time (during business hours, anyhow).
Vulnerability remediation is all well and good, but the culture shift — from what happens IF to WHEN this fails — is excellent. They have higher confidence in dealing with disasters as they happen and they have greater trust in injecting purposeful failure techniques. The change in mental models helps as well.
Since we'd not managed the steakhouse Tuesday night I'd made reservations for Wednesday. Branson, Dan, and David joined me at Bob's Steak and Chop House. I wound up starting with what should've been a 3-oz. pour (but was probably closer to 4) of Blanton's bourbon. I had a wedge salad (mm, blue cheese), the 22-ounce bone-in ribeye (medium rare), and the associated sides. It came with a large slice of maple-caramel-bourbon glazed carrot and a choice of potato; I went with the loaded baked potato which was excellent. We added an order of grilled asparagus (and one of sautéed mushrooms I didn't touch). I managed to eat the whole thing, though we did take the remaining asparagus to go.
When the chairs got back from their dinner out Branson and I dragged the remaining bourbon and rye from the scotch BOF (and the asparagus from dinner) to the Dead Dog party in Brendan's room. That ran from about 9pm onwards though I tapped out around 11pm when I was starting to get sleepy. (Heck, a large and good meal with a heavy pour of bourbon followed by drinking more rye at the party at the end of a 3-day conference? I'm glad I wasn't sloshed.)
I got back to my room and did all the pre-packing I could in advance of tomorrow's travel before going to bed.
Travel day! The body woke up around 6am so I showered, finished packing, checked out, grabbed the hotel buffet breakfast (with Jen until she had to bail for her own flight), and caught a taxi off to the airport. There was virtually no line at the taxi stand, or at bag drop (I was behind one of the MSI staff checking his stuff), the security line moved quickly, and I was at my gate a mere 20 minutes after getting to the airport. (That'd never happen at Detroit Metro.)
The flight was fine. Thanks to storms stretching across our entire flight path we had moderate turbulence on takeoff and ascent, and mild turbulence at altitude (33,000') and during descent. As usual, we wound up taxiing for an absurdly long time. Despite the rainstorms there was no lightning when we arrived so the ramp was open and we could get to our gate. We actually arrived about 15 minutes early!
I got off the plane and over to baggage claim where my bag was twelfth off the conveyor. Swapped out the sweatshirt I was wearing for the leather jacket I'd packed, shoved the CPAP into the luggage, and hiked back up to the terminal to go over and down again to the interterminal shuttle (which took a while to arrive and which filled up to capacity before leaving). Once we got to the Big Blue Deck I let everyone else off the shuttle before me, got out and to the elevators as one opened so took that up to the terminal level to cross the bridge to the parking deck (on 4) then go down one level (to 3) to get to my car and drive home.
I got home around 2pm, about an hour and a half after landing — literally 1h35 from wheels-down in the airplane to parking in my garage at home. Cranked the heat up from 60F, unpacked my bags, did a load of laundry, and started writing up the rest of this report... over the next week and change.