Josh Work Professional Organizations Trip Reports Conference Report: 2018 LISA

The following document is intended as the general trip report for me at the 32nd Systems Administration Conference (LISA 2018) in Nashville, TN, from October 29–31, 2018. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.


General Commentary

This year was the first in a new conference program layout: Three days of technical sessions with no more than four tracks (2 invited talks (in either 2x45 or 3x30) and 2 trainings (in 1x90)), plus Labs (for hands-on post-training followups and generic testing). I had said I was going into this with an open mind and not a knee-jerk "this sucks because it's different" reaction.

Some of the reasons for the change are that we're smaller than we used to be, we don't do full- and half-day training any more (limiting training to a 90 minute block) and our room requirements have changed accordingly. We've eliminated referreed papers (which ended in 2015, when we accepted 4 of 13 submissions) and workshops (which ended in 2016), and now with fewer tutorials our room needs are 2 ballrooms for the talks, 2 smaller (50- to 75-person) rooms for the trainings, and a third similar space for the NOC/Labs/Build, plus a concourse for registration, a ballroom or expo floor area for the vendors, a hidden staff office, and possibly storage space. We're smaller in part because of the existence of other focused regional conferences (like SREcon) and the perceived reduced relevance in our content.

In general the worst I can say about the change is that it was surreal. For the past quarter century — literally; this was my 25th consecutive LISA conference — my brain's been programmed that the technical sessions were Wednesday through Friday and the reception was Thursday night. This year it was Monday through Wednesday with the reception on the Tuesday, so "what day is it" was a more complex mental lookup than usual. If we continue down this path I'm sure I'll get used to it.

That all being said I have some issues that aren't unique to the six-to-three change. I've observerd that whether the talks are 3x30 or 2x45, many times the speaker needs more time than they've been allocated. We had many sessions this year where there simply wasn't time for Q&A, and a noticeable delay as speakers swap out their laptops. I suspect it won't go over well to pre-hook up all three speakers' laptops in advance (what are they going to use when they're not speaking?), but maybe they could use the time to disconnect the first speaker's laptop and connect the second's for Q&A for the first speaker. It's irritating and inefficient.


Sunday, October 28

Travel day! I was up early — the alarm was set for 6am and I was up by 4:30am anyhow. I showered, finished packing, made the bed, unloaded the dishwasher, set the thermostat lower, and hit the road shortly after 5am. There was virtually no traffic on the road to Metro airport, I found a parking space in my usual zone (3E4/3E5 of the Big Blue Deck) quickly, got to the terminal shuttle as one was pulling up, and got to the terminal with only one red traffic light. There was no line to drop off my checked bag, only one person ahead of me in the TSA line, and no huge backup at the scanners, so I was in the terminal and at my gate in plenty of time to catch up on most of my backlog of magazines before boarding began for my flight.

We were actually all on-board, seated, and had an early departure. The flight itself was mostly uneventful. Our lead flight attendant was a bit of a comedian (using expressions like "I don't usually say this but the drinks are on me" to the first-class cabin, and "flight attendants will be coming through the cabin to pick up items you want to dispose of, but we cannot accept small children or in-laws," and "be careful opening the overhead bins but remember that shift happens"), but he kept my kalimoxtos refilled. Other than a bit of turbulence on takeoff it was a smooth ride.

We arrived early in Nashville and my bag was somehow second from my plane onto the conveyor; there was no line for a taxi, and when we got to the hotel there was no line to check in. They even had my room ready so I could get upstairs and unpacked before heading back down to wander the conference space as they were setting up. Said Hi to the folks in Labs, snagged a donut (mm, breakfast — the banana and small bag of snack mix on the plane with the kalimoxtos didn't count), then headed out to find lunch. I wound up having a really tasty carnitas burrito at Bajo Sexto Taco.

After lunch I took a much-needed power nap before heading back down to the conference space. I wound up heading across the street to Martin's BBQ for an early dinner with seven other people (only two of whom I knew beforehand). I had a pulled pork tray with baked beans and mac-n-cheese and everything was really tasty. We got back to the hotel shortly after the registration booths opened so I got my badge printed and headed to the Welcome Reception where I schmoozed, nibbled cheese (mostly the blue), and basically networked until about 9pm when I decided to head to bed and try to sleep. Of course, I didn't succeed in sleeping until nearly 11pm, but oh well.


Monday, October 29

I slept in (not counting a bio break or two) until 6am or so. Did my morning ablutions, got dressed, and headed down to the conference space for the continental breakfast (I had a croissant with strawberry preserves and a banana) before the opening remarks which were scheduled to begin at 8:45. Met some new people, caught up with some old friends (including Amy, Brian, Cory, Doug, Jennifer, and Pat), and wandered into the keynote space in time to get a reasonably-close table with a power strip.

Opening remarks: Thanks to all the folks who helped put this together (from program to office to sponsors). We had a record-high 329 total submissions, which got 1106 reviews and 913 feedback comments, and we accepted 81 talks and tutorials. We also had a total of 7 keynotes on the program (though we had to scratch one because the presenter was ill). We were reminded to go to LISA Labs for hands-on learning about new technologies, to go to the Birds of a Feather (BOF) sessions in the evenings, and to go to LISA 2019 in Portland Oregon next year (October 28–30).

Our first keynote speaker was Jon Masters who talked about how the underlying hardware's attempts to be faster led to the Spectre and Meltdown attacks earlier this year. Part of the problem is that hardware and software people go out of their way to not talk to each other. We've spent decades making machines run faster but never asked what happened in return for that. He spent too much time (in my and others' opinions) refreshing the audience's understanding of the underlying computer architecture, microarchitecture, firmware, branch prediction, virtual memory, and caches. Once that was done he started talking about side-channel attacks and gave a quick overview of how Meltdown and Spectre worked. His recommendation for what's next is to change how we design hardware, change how we design software, and actually have the hardware and software people talk to each other.

Our second keynote was Tameika Reid on "The Past, Present, and Future of SysAdmins." This was mostly a skill-mapping talk and didn't seem very keynotey to me (and to some others I talked to). Her thesis was that people keep saying "systems administration is dying" but they really refer more to the title "sysadmin" since the skills we have — problem solving and analytical skills; virtualization; cloud; ai, ml, blockchain, and big data; communication; scripting and programming languages; repos; networking, DNS, DHCP, and SDN; automation (such as Ansible); performance tuning (such as with perf); testing; and security (hardware, software, physical, and social) — will still be relevant. Some new titles we may see include Blockchain Engineer, Chaos (or Intuition) Engineer, Infrastructure/Automation Engineer, and System Architect. We'll also be seeing more with quantum computing, the merging of security into DevSecOps, more IOT-related administration, and automotive-grade Linux such as with autonomous vehicles.

After the morning break I went to one of the invited talk sessions. First up was "Introducing Reliability Toolkit" about how ING started with the goal of improving their services' reliability and realized they didn't monitor well. Without direct alerting they would lose 69 minutes on average before major incident resolution even began. They measured business functions and most of their 300 teams didn't know their application-perspective's uptime perception. They built a toolkit based on Prometheus to create alerts based on a time series from a database; an alert manager to route those alerts to email, SMS, or ChatOps as relevant; Grafana for graphing; and Model Builder to compare the desired metrics to the actual metrics. They provision it for those application teams with default configuration files, templates, dashboards, and some alerts. Their role is to maintain and update binaries (test and release new versions, including both security and penetration testing). They deliver 5 machines (1 test, 2 acceptance, 2 production) so the customer unit can have nearly full responsibility; they also do OS-level patching. They also provide client libraries to scrape metrics from the servers; other teams also work on exporters and they're internally open-sourcing things to reuse others work.

They learned that having cool technology doesn't mean people start using it. Having the client libraries in the engineering frameworks help a lot, as does ensuring a good feedback loop with your customers (such as determining what's prohibiting or inhibiting them from using it). They educate during onboarding and workshops. Other teams may not understand a lot of the monitoring/metrics pieces (for example, ELK is not the same as metrics). Having dashboards accessible to all engineers has been helpful.

Their model builder knows that weekends and holidays have different expected traffic. Currently they only do averaging models. It's currently closed-source within their institution but they're looking to open-source it. Within their institution they effectively offer Prometheus-as-a-Service. Model Builder isn't available yet and they aren't sure how to release and announce it.

The second talk in this session block was "Incident Management at Netflix Velocity." Netflix sees hundreds of billions of events — like "click on the UI" or "play a movie," ignoring the actual video data — flow through them every day. They therefore have billions of time-series metrics updated every minute. Their goal is to provide "winning moments of truth:" When someone chooses Netflix for their entertainment. For them, Seconds Matter.

When they moved to the cloud they wanted to make sure that the disappearance of an (AWS) instance don't cause an application outage. Chaos Monkey helps with that: It's randomly clobbering production instances. But what else should they be testing? They're good at running software in the happy path (for example, A calls B, gets response, yay) and the sad path (A calls a dead B), but what about the gray area (where B responds only partly or corruptly)? They created Latency Monkey to insert latency at the common RPC layer. They put the current priority metric against it to see how cranking up impact and latency affected the customer success metric.

They learned things from their testing: Application behavior (during an applicatiob migration that used the same common layer), blast radius (affecting 100% of a service instead of some of it), and consistency (they went faster than they should've). They had to think differently about failure, knowing they'd scale bigger and higher, so future failures would be more complex.

So now they think about:

Given all that, what's their Incident Management look like?

Their tactics are:

Lunch was held on the vendor expo floor. The menu included crudités with pimento cheese; music city salad (romaine, hard boiled egg, and some other stuff I forget); couscous with feta, tomato, red onion (add bacon, red wine/shallot vinaigrette); sliders, both cheeseburger and fried chicken; and cookies and chocolate tarts.

The first block after lunch had three talks. The first was Brandon Bercovich;s "Designing for Failure: How to Manage Thousands of Hosts Through Automation." He believes there are three problem spaces (limiting the discussion to stateless services):

CLM's dispatcher is a combination of business rules and rate limiting. Its workers do the work as commanded by the dispatcher. It is bootstrapped with a simple foundational stack, manually placed. (It's their only manually-placed service.)

The second talk in this block was "Familiar Smells I've Detected in Your Systems Engineering Organiztion... and How to Fix Them" by Dave Mangot. Why "smells?" It's one of the strongest memory triggers we have. A lot of this was common sense — crawl before you walk before you run, keep staging like production, script or automate what you can (including failure resolutions), and so on. (I found the repeated musical riffs both distracting and annoying (and possibly unlicensed). His playing a sad two-tone on the harmonica didn't add value either.)

The third and final talk in this block was Michael Kehoe & Todd Palino on "Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way." Their thesis is that any A team can declare a "code yellow" state when they're in a very bad spot (SRE toil, bad smells, etc.). Think of it as a yellow card in soccer. They talked about how to identify team anti-patterns (blame-free, open and honest communications), how to work through high toil, and how to create sustainable workloads. They walked through a couple of examples and their key lessons learned were to measure (toil and overhead), prioritize (to remove the toil), and communicate (with partners and teams). Getting into a code yellow needs to have both a problem statement and exit criteria, and management (of both SRE and Development) has to buy in and be involved. A code yellow is basically a breaking point where current processes aren't sustainable. (Managers can sometimes see these coming, but not always.) Measure; look at trends. This is very much a cultural issue for the organization; SRE or operations needs to have the ear of executive management at the same level as product management does. The communications — both what is and isn't going well — is essential.

After the afternoon break I went to the training session "30 Years of Making Lives Easier: Perl for System Administrators" by Ruth Holloway. She gave an excellent overview of the state of the Perl ecosystem, what tools are out there, and provided a repo with some examples we could view in real-time. She had a few general tips which made sense and are good practice.

After sessions ended a group of us (me, Branson, Duncan, Jen, Kathy, and Mark) headed out for dinner. Four of them hadn't yet gone to Martin's and the two of us who had didn't really care enough to object so we went there. Branson and I split the Big Momma platter: 4 spare ribs, 6 oz. each of brisket and pulled pork, and 4 sides (baked beans, coleslaw, fries, and mac-n-cheese). Then while the other 4 went back to the hotel Branson and I ran to the liquor store on 3rd Street to see what we could get for a small 2-bottle scotch BOF. We wound up with a bottle of Glenlivet 15 and a Basil Hayden's dark rye with port finish.

We got back to the hotel in time for the 5-minute Horror Stories BOF. I kept time — nobody went over 4 minutes so we could have everyone who wanted to contribute do so — and our winner was the gentleman with the "pigeon caused the data center fire" story. (His prize was the applause used to determine the winner.) I didn't contribute a story since nothing was of sufficient horror to warrant it, though many of the stories were more funny than horrific, so in retrospect I could've told the "Software wouldn't install because our hardware was too good" story from March 2009.

After that broke up around 9pm Branson pinged the Labs staff and a couple of others that we were starting and eventually six of us sat around in his suite drinking the booze and chatting from about 9pm to midnight. I left around 11:45pm to go to bed and was there by midnight.


Tuesday, October 30

Despite my late bedtime my body insisted on waking me up for good a bit after 5am. Managed to kill some time working on the trip report — writing and revising the trip report entries for yesterday and putting my likely schedule in as a template for the rest of the week — before shaving, showering, and getting ready for the day.

The continental breakfast in the conference space opened at 8am; I repeated my croissant-n-jam and banana. I caught up with Amy, Dan, and Doug, and made it to the keynote session in time to grab a seat with an available power strip.

The first block was another set of keynotes. There were supposed to be three, but the scheduled first speaker was sick so the remaining two speakers each had a 45-minute slot. And they needed that extra time.

The first was Dr. Sarah Lewis Cortes talking about "Anatomy of a Crime; Secure DevOps or Darknet Early Breach Detection." Starting with questions to gauge how much of the audience had heard of terms like darknet and Tor she talked about the Roman Seleznev case as background before a technical deep-dive into a breach. We saw darknet credit card search results (number, CVV, and expiration date in cleartext) and a skimmer. She talked about a retail hack: Burp Suite to crack passwords — it's even easier to buy credentials on the darknet. An event starts with an attacker getting credentials (such as from a supplier or password crack). They then target large retailers because they have hundreds or thousands of point-of-sale (POS) systems on which the malware can be installed.

The breach timeline:

As of Oct 2017 the breached-customer PII is still on the darknet.

She gave several more examples of breaches before going into the fundamentals of the darknet: Tor, JonDoynm, I2P, et al. Tor is a volunteer network to bounce traffic through (typically) 3 relays so noone knows the true origin or destination, maintaining anonymity and encryption (though it's in cleartext in the final hop). Legal uses of Tor include privacy. Empirically however most of its traffic through is criminal in nature. So the darknet is effectively an overlay network on top of the Internet, using an alternative Internet addressing scheme to DNS with untraceable IP addresses. It was originally funded by the US Navy, went public in 2004, and now has 2M sessions over 6300 nodes.

So what can we do about this? Segregate networks between corporate and POS networks, keep credit card numbers (and other PII) off other spaces. Restrict access (which is hard with self-checkout), encrypt the applications and communications, and don't keep the stored data.

The second talk was "Do the Right Thing: Building Software in an Age of Responsibility" by Jeffrey Snover (incidentally the inventor of the now open-sourced PowerShell). He started by asking, "Why do we need to become better engineers?" Software affects everything, either the product itself (as in Amazon versus bookstores) or enhancing value (such as in a car). We're engineers. Where are we going to be in 40 years? We already have planetary-scale cloud infrastructure, 2.5 billion cellphone users, and there's an expectation of 6 connected devices per user in 2020 with smart homes and smart power grids, so the trends are that software is driving more and more systems.

If software eats the world, where does it come from? Engineers! So we need to ask ourselves, if we're building the fabric of the future what kind of world do we want to build? You are responsible for the moral responsibility of your actions and your code. (We — especially as software engineers — can always say No, or leave for another job. We should also consider if a company's business model and model standing align with our own before accepting an offer.)

What are the issues? We're effectively in a fourth industrial revolution with big data (after steam, electricity, and electronics). Technology disrupts everything; issues include job displacement, community safety (as opposed to surveillance), income inequality, and unequal access. AI can increase a country's GDP, but also replace either full jobs or major job tasks (depending on the job) via automation. The tension between technologists and society led to rules for design, training, safety, and so on.

How do we move forward? We need to step back and play the long game: Consider what's for the sake of the technology, of the business, and of society.

(As a sidebar, it is not a legal requirement for a company to maximize shareholder value. That idea came up from the 1970s. Maximizing shareholder value is just one of the important things, but shareholders are only one stakeholder. Companies should maximize for all constituencies — employees and customers should be considered as well. Maximizing the benefits for all constituencies is difficult but necessary.)

Are we building for all of society? Note that disabilities can be permanent or temporary. Some of his examples included:

When in doubt, focus on solutions that amplify human dignity. At Digitial the one rule was "Do the right thing." They trusted that each individual moral action would do the right thing... and had both the permission and the responsibility to do that.

When building AI, is it inclusive? What about confirmation, dataset, associations, automation, and interaction biases? In fact, "Artifical Intuition" might be a better term; it's not true intelligence. Bad data in means bad results out. AI has no moral compass or empathy and cannot be a moral actor. (Not all people have human empathy: psychopaths.) AI is really "psychopathic intelligence" because it has no empathy. We therefore need to consider boundaries around what we will let AI do.

Bottom line: AI will cause harm. That said, it'll be less harm than if we avoid it. We can get models that work (such as Underwriters Labs (ul) for all electrical devices). This is functionally trust engineering: They've engineered trust into the system; you'll buy electronics in the US without worrying about electrocutions. We may need something like that for AI.

We all have a role:

DevOps is really the answer: Hypotheses, testing (true: go on; false: stop; or didn't matter: meh), iterating small batches quickly. That's process engineering by definition.

In the second block of talks we had another cancelation. What was to have been "The Divine and Felonious Nature of Cyber Security (A DevSecOps Tale" became instead a DevSecOps panel of four people with backgrounds in one or more of development, general operations, security, and web operations. After a quick round of introductions they defined DevSecOps as integrating secure processes into DevOps. What does that mean to the panel? For one it's integrated; they're all part of the same world. Another says that security should be embedded with their daily DevOps folks and get involved early in the project or process. For another it's an approach: What do I need, what fundamentals do I need to consider to make this Thing more secure. The goal is to enable engineering to do what they need and want to do while considering security (and automation and upgrades) from the ground up. DevOps breaks the "Ops always says no to Dev" and "Sec always says no to Ops" models. There was consensus that using the "Yes and..." approach from improv helps.

It was noted that there is a difference between those enforcing compliance per some standard or spec and those implementing it. That's one of the challenges between Sec and [Dev]Ops. It's shifting to the better as time goes by; you can't just be a security person with a CISSP and no longer do some level of the implementation. The security person needs to be able to build a Thing and make sure it's compliant, not just throw something over the fence.

It was also pointed out that "security comes in to do it once" and "security as compliance" are different. The latter is an iterative process... and iteration is very devopsy. Bringing empathy into the security world, like it is in DevOps, is a good thing. It takes a learning organization to do iteration to grow the business. Automation helps move away from the "secure an individual Thing" model.

At the end of the day Security is there to make sure that Bad Things don't happen. So given all that, how do we in practice improve things? "Good ops is good security" and "Dev has been trained to have other priorities than Security." Any implementation needs to be constant and tested.

Having Sec sit near Dev[Ops] helps... and having them talk to each other helps even more. What's a problem? Authenticating users. Build a library to do it in one common and secure way to let the developers focus on the "interesting" stuff. Enable transparency; let Security provide Stuff to Dev to make things better.

It's not malice; it's ignorance. Different specialities have different priorities and knowledge. Automation is better, communication is better, and education is better. Does Dev know about this Sec issue? Does Dev think about possible vulnerabilities or exploits at development time?

Now that we have standards... how do we test all of this together? Regardless of the CI/CD tool, how do you do test it? Is chaos engineering the right answer? Try to break everything; it's a common way to expose fault in systems. There are a lot of unknown unknowns. Snapshot your 2-year-old AWS instance then kill it and look at what beaks. Some people won't believe there's a problem until it truly doesn't work and impacts business.

How people interact with each other has changed. "We can work better together as humans," regardless of job function or tools. Where would this not work, environment wise? For it to work in government there's a push for high-impact secure teams including PMs and security people, to go in and restructure. (CDC and FBI are pushing this: Get a 10-to-12-person team to fix things in 6 months before moving to the next project.)

Strict separation of duties can impact this. You need to figure out the workflow for what they're trying to accomplish, and involve all parties to meet the spirit of the law. You want the benefit of multiple people without slowing things down unnecessarily.

You also really need management to buy in to giving [dev sec and ops] the cycles to implement it properly. Don't focus on the org structure necessarily; start with the low-hanging fruit to drive discussions and moving forwards.

The second talk in this block was Robert Allen's "How Houghton Mifflin Harcourt Went from Months to Minutes with Infrastructure Delivery." Houghton Mifflin Harcourt is more than publishing; his area looks at improving the K-12 learning experience for students and teachers. The problem is that it took 12–16 months to get infrastructure. They missed a business window; their competition was leaving them behind.

Their top five goals were:

That led to the creation of Bedrock Technical Services, the foundation under everything they do. They spend the time making Bedrock safe for engineers, not making engineers safe for Bedrock. They isolate the developers' chaos from other influences (and have been generally successful). They use Mesos (and a little bit of containers) to do it. That gives them the flexibility for a large pool of resources for teams to function in with reasonable isolation. There have been some cross-over impacts (and they're working to reduce risk there).

It's really a philosophical change, not a technical one alone. By decomposing responsibilities and allowing mistakes and failures they're not flexible to do things more quickly. A lot is based on behavioral design: deliberately creating and delivering tools that make the RIGHT thing the EASY thing. When you can't make the right thing easy you must instead make the wrong thing the increasingly difficult course of action to take.

They set expectations... for applicsations,people, vendors, infrastrucutre, pets, and selves. They then inspect to make sure those expectations are being met.

Remember that all of these problems are truly people problems and you want to improve everyone's life experiences. They have on-call but are picky about what they're on-call for. Work-life balance is important.

After the lunch break — grilled sandwiches, both with and without meat; green, fruit, and potato salads; and a selection of desserts — I went to the training session "A Hacker's View of Your Network-Analyzing Your Network with Nmap." I didn't take copious notes, but it was good exposure to a tool that I knew existed but haven't ever had to do much with.

In the final block of the day they had three 30-minute talks. The first was Jonah Berquist and Gillian Gunson on "MySQL Infrastructure Testing Automation at GitHub." They test restores via Kubernetes now. What if they fail? Once, who cares; a lot, investigate in depth. By doing this they know how long a restore takes and can get a window into recovery times.

They also test online schema migrations... but they can't use trigger-based migrations because the tables were too busy and the lock requirements made Bad Things happen. Now they use gh-ost which uses binlog-based migrations (and binlogs are on both master and replica, so there's less load on the master), and because they write on the ghost table there's no lock contention with the original table.

In summary, trust your infrastructure/backup/automation if you can test it. Automate the testing. Build tools that can be tested in production by robots.

The second talk in the block was Tiffany Longworth's "Change Management for Humans." Her goal was to build an argument to compile in people's brains and run on their meatware. The model used here is ADKAR:

The third and final talk in this block was Tom Limoncelli's "Operations Reform: Tom Sawyer-ing Your Way to Operational Excellence." He told three stories:

In summary, do it yourself! Go home and apply this in our own organizations, and tell him how it went. And maybe even put together a talk for LISSA19. But do it grass roots, one team at a time, evolve the rubric over time, provide resources and use a spreadsheet not custom software, be blameless and transparent.

Based on an audience question, his advice for those in toxic environments was:

After the sessions ended I headed up to my room to drop off the laptop bag (and the swag I snagged from the expo floor for the cow-orkers) before heading down to the reception. Due to some poor planning on the hotel's part, the only level path from our area of the hotel to the Broadway ballroom where our reception was held was blocked by another group (a formal reception with a plated sit-down dinner for Gilda's Club) who didn't want several hundred geeks in schlubwear wandering through, so our instructions were to go down to the first floor lobby, walk down the hallway, and then go up the spiral staircase. Great for the able-bodied; not so great for those with mobility concerns (and no elevator on that end of the floor). As an able-bodied formerly-less-mobile person I raised a stink. Usenix staff quickly got the hotel to realize there was an issue and their fix was to have a badged hotel employee escort us through the GC space to our own reception without bothering with elevators, escalators, or stairs. (By the time we were leaving the other group was inside their own ballroom and not forced to interact with us plebes.)

The reception was a reception. The schtick was the print-your-own conference t-shirt area: You told them the size shirt you wanted, they handed you a blank gray t-shirt, you got to a silkscreening station, one of their staffers fitted the shirt onto the form, they added the white ink and instructed you how to press the squeegee down and run it across the screen, you did it (possibly with their help), they carefully fold it ink-side up, run it through a tiny 360-degree oven to set the ink, roll it up, and you walk away with a conference t-shirt. Cute. The lines were longer for the shirt than the food (not that they ran out of either).

The reception food was okay: A decent salad bar with reasonable lettuce and vegetable options; small-portion fried chicken-and-waffles skewered with either pickles, okra, or both, with bourbon maple syrup and hot sauce available (good, though the waffles tended to get cold quickly — duh — so I wound up asking for a few pieces of Just Chicken); and salted caramel chocolate tartlets (good, though the caramel was a bit runnier than I was expecting).

(A few of us left early to try to get in for a quick bite at the attached steakhouse; they were booked and couldn't get us in [and out again] in the time we needed since someone had to run a BOF at 8pm so we went back to the reception to eat dinner.)

After the reception ended and most folks went on to BOFs, I wound up back in my room briefly to change into a swimsuit before heading down aroudn 8:30pm to the outdoor heated pool and the hot tub. Around 9pm or so Tom joined me there and we caught up on Life and Stuff and Things before bailing a bit before 10pm.

I got back to my room, showered off the chlorine, rinsed out the suit, and pretty much went to sleep.


Wednesday, October 31

I slept in without an additional bio-break until shortly after 5am. When taking care of biological necessities I managed to stub my big toe on the bathroom door enough to keep me awake, so I caught up on email and skimmed my social media before another brief shower. Trimmed the fingernails (since if I didn't they were just long enough that the odds were one would chip or tear before I got home, and with my luck that'd happen inside the airport security perimeter where I wouldn't have access to my clippers), got dressed, and headed down to the conference space for yet another continental breakfast; this time they had bagels so I had an everything bagel with schmear and a banana.

Day three (and last) of the technical sessions kicked off with a three-talk block. First up was Paul Carleton's "How Our Security Requirements Turned Us into Accidental Chaos Engineers." Their problem was that old (in terms of time since launch) cloud instances would have more problems over time. Breaking changes cause their replacements to have problems. Replacing them more frequently can reduce that risk. Security can also be an issue; compromised old hosts are more likely to wreak havoc on your environment, and post-patch instances are good but how can you be sure the older instances truly got patched?

Their solution was to create Lifespan Management (LM) to enforce a maximum lifespan and replace things when they reach that. There's an autoscaling group to add new nodes and make sure they're healthy, a terminator service to clean up old ones gracefully, and an enforcer that makes sure old nodes that were told to terminate themselves actually do. They went with terminate-first instead of build-new-first to enforce resilience. They came up with a rollout plan, labeling things as to their function: Stateless (safe to replicate), stateful automated (can be done), and requires operator (not safe for automation). They labeled things and rolled it out. What could possibly go wrong?

Well, they learned what could possibly go wrong. Specifically they learned 5 chaotic things:

The big takeaway: What automation problems can you cause with a little chaos? Do you know how old your instances are?

The second session in the block was Pat Cable's "Securing a Security Company." If you learn nothing else, secure means nothing without context. In the security world this is threat modeling. Securing a safe is different than securing a donut shop. Constraints are real; so there's no such thing as 100% secure in reality, especially for small companies.

The conflict between "Gotta go fast" and "Secure all the things" led to their security program's evolution, including access controls, cryptography, compliance, and chaos events. To get started with doing this kind of stuff in our environments start small. Identify your highest risks and chip away at them. You won't reach your destination overnight. Bring your teams together: People, then process, then tools. Build tooling for automation, especially with UI improvements (security-made-easy is a quick win). Figure our your own context: Be aware of how it changes and how events may change it.

The third session was Sabice Arkenvirr's "We Already Have Nice Things, Use Them!" about the pitfalls of rolling your own tools (speaking a a consultant looking at lots of companies they were at). Examples are monitoring, config management, and deployment. Why build your own? Reasons ranged from "Nothing was available when we wrote this" to "We don't like reading manuals" and everything in between. Only the first is acceptable... until the tool exists Out There.

There are operational risks to rolling your own: What if the developer leaves? What're the docs like? What's the lost momentum for a new hire? What's the company's core reason for being (usually not "writing our own tools")? Using industry standard tools helps address all of these.

All that said, when rolling your own, consider:

If you do have to roll your own, apply the same testing and documentation standards to it as you do to your own operations tools. Keep it small and simple. Allow enough time in sprints for feature requests [and testing and documenting].

Acknowledge that in-house is a short-term solution up front. Be explicit that maintaining the in-house tool is not the desired goal. Allow time to research open source alternatives regularly (quarterly? annually? depends on your environment). Know that testing and implementing what you found in that research takes time, and estimating timeframes can be difficult (and inaccurate).

Now that you have your tool you can open source it. But you should have a community manager. (This isn't feasible for smaller places.)

After the morning break, for the second block of the day I went to Branson Matheson's two-topic training session, "How to Write Effective Training" and "How to Give Effective Training." His stated objective was to get more trainers for LISA conferences. For those who haen't done this, what are better tools for training? What troubles that people have had when training can we address?

Know who you're teaching to (and their learning styles). As a process:

  1. Start with an idea:
    1. Wordsmith and brainstorm off of it (wordcloud/mindmap/...) — refine later, write now!
    2. Develop it to an objective.
    3. Highlight key ideas and cross out the ones that aren't as applicable.
    4. Go back to it all a day or two later.
    5. Develop the core idea.

  2. Scope it, incl. time and audience constraints.
    • How well do you know the material? Can you teach it successfully?
    • How hard will it be to develop methods to teach the concept?
    • How long will the class be (90m, 8h, 2d):
    • How many students will you be able to support?
    • Will there be co-instructors to help?
    • Is there a known venue? Will it fit?
    • What exercise formats (individual/team, cooperative versus competitive, open discussion versus partner review)?

  3. Build the outline.

  4. Create the presentation:
    • Plan exercises around breaks and lunch if needed.
    • Leave time for discussion (not just questions).

  5. Exercises — keep 'em short, after questions, make them completable with what you taught (including prerequisites), be self-contained, and be relevant to the students' workplace.

  6. TEST TEST TEST!

  7. Make it awesome. A slide should be a story and short; slides are data reinforcement, instructor teaching is providing the material.

  8. Write AND SELL the proposal. For students: What do they need to know, have, and take back/away. Tell why you should be teaching it as well. End with a good tag. Do your biography as well (but last; include who not just what).

That segued into the second part, which can be summed up with:

After lunch — with no expo floor today, we did a buffet line of barbecue (baked chicken and smoked brisket, with both black eyed pea and green salads, cornbread, and mac-n-cheese then got to sit down in the other ballroom — I went to the third block of three talks. First up was Ivan Ivanov with "Mastering Near-Real-Time Telemetry and Big Data: Invaluable Superpowers for Ordinary SREs." I didn't take any notes because his slides, which will be available on the conference website, was a lot of links out to other material.

Next up was Nicolai Plum with "Datastore Axes: Choosing the Scalability Direction You Need." His theme was things to consider when providing a service for developers to build and support their own applications. Technologists should know what their business is doing and planning so they can plan and prepare better. Graphing data growth, schema growth, read queries, and write queries on a single graph lets them analyze the ways your needs are likely to grow and use CMM-type data modeling to figure out what's the right way to move forwards to futureproof Life and Stuff and Things.

Last up in this session was Andrea Barberio and David Hendricks with "Make Your System Firmware Faster, More Flexible and Reliable with LinuxBoot." Basically, firmware needs to do so much more it effectively needs to be a full and secure OS stack... so why not use Linux? See "LinuxBoot." They start with Coreboot for silicon initialization, then LinuxBoot to deal with storage and networking. Can be extended beyond firmware, such as to be a bootloader or OS installer.

The final block of the conference began with closing remarks from the chairs. As they talked the projectors displayed a single slide from each speaker summarizing their talk or training. They thanked us for attending and staying until the end, thanked the staff again, and asked people to stand if they were at their first LISA, or if they attended more than five or more than ten (but not "20 or more" or "25 or more," both of which I could've stood for).

The rest of the block was two 30-ish-minute keynotes. First up was Tom McLaughlin with "Serverless Ops: What to Do/This Is What We Do, When the Server Goes Away."

DevOps is more than pulling down the silo walls; it's forming cross-functional value delivery teams aligned around a business feature. Historically Ops was generally left out of this kind of reorg. If a service is considered or considering serverless as the default then Ops isn't needed and all we can do is play tollbooth operator... or be getting out of the monolithic Ops organization and joining the value-delivery teams. This is really what we should be aiming for: How do we deliver more efficiently more quickly to our customers, users, or patrons?

The focus needs to be less on the server but more on the service, and on the user being happy or satisfied. To accomplish this we need to code (to understand the code triggering the failures that we need to remediate).

Doing DifferentOps takes up some time; what about the rest of our time? Game day drills, chaos engineering, and other disaster drills? It's new-to-you work that you haven't had the time to experience yet.

Serverless is more FinDev than DevOps because cost is becoming the first-class metric. Every component and change has a cost (positive or negative), and serverless billing lets it be more easily tracked. We also care about revenue, not just cost. This means understanding how the organization works and sustains itself, and what you're trying to accomplish. Focus on the organization's unique properties and problems, not explicitly on ops and administration (and be okay with what you do and why (referencing Jeffrey Snover's talk).

The mindset shift from "How do I..." to "Why do I..." is the necessary bit. Many of us got into SA to solve problems not to just administer servers.

Last up was Nora Jones' "Chaos Engineering: A Step Towards Resilience" keynote. Nora's view on chaos engineering (CE) is different. She's done CE and has seen the benefits of doing it.

What is it? Building confidence in understanding how systems work and fail, exploring the unknown unknowns.

Why do it? Not to find vulnerabilities or reduce outages (per se), since prioritization matters. We want to measure RISK not ERRORS or OUTAGES. You want to minimize the risks. The goal is to be more prepared for turbulent conditions when they inevitably occur. A goal is to be resilient: Adjust the system to maintain normal functionality — and not just the server as system but the people and process and culture as well.

Doing CE is not a silver bullet but it's part of the solution suite for reducing outages and risk and so on.

How can I increase confidence in resilience?

When you fnd a vuln, find out what happened:

CE isn't to CAUSE problems but to REVEAL existing ones. And when you find something you need to actually fix the underlying problem.

At Netflix their focuses are:

They use a tool called ChAP (Chaos Automation Platform). Consider customer impact when experimenting. Setup involves:

  1. Injection points (and they have many)
  2. Automated Canary Analysis Configrations (ACAs, and they provide canned ones to aid in setup)
  3. User-added stuff (requiring understanding of ChAP)

Goal is to chaos all the things and all the time (during business hours, anyhow).

Vulnerability remediation is all well and good, but the culture shift — from what happens IF to WHEN this fails — is excellent. They have higher confidence in dealing with disasters as they happen and they have greater trust in injecting purposeful failure techniques. The change in mental models helps as well.

Since we'd not managed the steakhouse Tuesday night I'd made reservations for Wednesday. Branson, Dan, and David joined me at Bob's Steak and Chop House. I wound up starting with what should've been a 3-oz. pour (but was probably closer to 4) of Blanton's bourbon. I had a wedge salad (mm, blue cheese), the 22-ounce bone-in ribeye (medium rare), and the associated sides. It came with a large slice of maple-caramel-bourbon glazed carrot and a choice of potato; I went with the loaded baked potato which was excellent. We added an order of grilled asparagus (and one of sautéed mushrooms I didn't touch). I managed to eat the whole thing, though we did take the remaining asparagus to go.

When the chairs got back from their dinner out Branson and I dragged the remaining bourbon and rye from the scotch BOF (and the asparagus from dinner) to the Dead Dog party in Brendan's room. That ran from about 9pm onwards though I tapped out around 11pm when I was starting to get sleepy. (Heck, a large and good meal with a heavy pour of bourbon followed by drinking more rye at the party at the end of a 3-day conference? I'm glad I wasn't sloshed.)

I got back to my room and did all the pre-packing I could in advance of tomorrow's travel before going to bed.


Thursday, November 1

Travel day! The body woke up around 6am so I showered, finished packing, checked out, grabbed the hotel buffet breakfast (with Jen until she had to bail for her own flight), and caught a taxi off to the airport. There was virtually no line at the taxi stand, or at bag drop (I was behind one of the MSI staff checking his stuff), the security line moved quickly, and I was at my gate a mere 20 minutes after getting to the airport. (That'd never happen at Detroit Metro.)

The flight was fine. Thanks to storms stretching across our entire flight path we had moderate turbulence on takeoff and ascent, and mild turbulence at altitude (33,000') and during descent. As usual, we wound up taxiing for an absurdly long time. Despite the rainstorms there was no lightning when we arrived so the ramp was open and we could get to our gate. We actually arrived about 15 minutes early!

I got off the plane and over to baggage claim where my bag was twelfth off the conveyor. Swapped out the sweatshirt I was wearing for the leather jacket I'd packed, shoved the CPAP into the luggage, and hiked back up to the terminal to go over and down again to the interterminal shuttle (which took a while to arrive and which filled up to capacity before leaving). Once we got to the Big Blue Deck I let everyone else off the shuttle before me, got out and to the elevators as one opened so took that up to the terminal level to cross the bridge to the parking deck (on 4) then go down one level (to 3) to get to my car and drive home.

I got home around 2pm, about an hour and a half after landing — literally 1h35 from wheels-down in the airplane to parking in my garage at home. Cranked the heat up from 60F, unpacked my bags, did a load of laundry, and started writing up the rest of this report... over the next week and change.



Back to my conference reports page
Back to my professional organizations page
Back to my work page
Back to my home page

Last update Feb01/20 by Josh Simon (<jss@clock.org>).