The following document is intended as the general trip report for me at the 30th Systems Administration Conference (LISA 2016) in Boston, MA, from December 4–9, 2016. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
Travel day! My body clock decided that 4am was time to wake up so I grabbed a quick shower, packed up the CPAP and toiletries, unloaded the clean dishwasher, took the recycling to the garage, shut down the laptop that was staying home, loaded up the car, and headed out to Metro. Traffic was moving at or above posted speeds all the way there, and I parked in my usual space. Caught the waiting interterminal shuttle, got my bag dropped off and through security with no delays to speak of, and got to the gate... 2 hours before boarding was to begin. No problem! I caught up on most of my 6-month backlog of magazines. (All but 3 were done before boarding; I finished those on the flight.)
We landed early (hooray for a good tail-wind I guess), and it took a little less than 90 minutes to get from my seat on the plane to the check-in line at the hotel (first off the plane, third bag off the baggage carosel, brief waits for the Silver Line bus, Red Line train, and Green Line train). Got checked in, got to the room (they upgraded me to a slightly larger space, though it's still just one large room), unpacked my stuff, and headed down to the conference area to meet up with Christopher (from Pittsburgh) and Patrick (from Sydney) for lunch. We tried 5 Napkin burger (35-minute wait) and Legal Seafood (25-minute wait) before heading out the Pru doors and down to Haru for sushi. I did the sushi special with salad, 7 pieces of nigiri (2 salmon, 2 tuna, mackerel, and a couple of things I didn't recognize but which were tasty), and a 6-piece tuna roll.
After lunch we headed to the hotel bar, met up with Branson, Eric, and Casey of the Build team who were waiting on equipment to arrive, and schmoozed for a while. My knees were annoyed at the all the sitting so I took a quick nap up in my room before heading back down to pick up my badge when they opened up at 5:30pm.
Picked up my registration materials (except for my Tuesday workshop ticket; when I brought that to the staff's attention we realized there WERE no tickets for that workshop, so they'll figure out what happened), grabbed my conference t-shirt, schmoozed with people, nibbled some cheese and fruit, and eventually went off to dinner at Whiskey with a bunch of the LISA Build team (who assemble the wireless network from scratch starting Sunday, since hotel networks can't stand up to 1100-odd (and some of us are pretty darned odd) sysadmins) and a few others. I wound up being part of a 5-person table (with John, Cat, Peter, and Diego) in the front of the restaurant across from a 4-person table and the 9 of us were all on one check; the other 5 people were at a table at the back of the restaurant. I wound up getting the full slab of ribs (since if I couldn't finish it I had a mini-fridge back in my hotel room, thanks to the upgrade), which came with a small ramekin of slaw, a serving of their peach baked beans, fries, and cornbread. Ate most of it, left about 3/4 of the slaw (never one of my favorites), and took 2 ribs and most of the cornbread to the hotel to have for breakfast Sunday.
After getting back to the hotel I hallway tracked in the lobby (since there was nobody from LISA in the bar), met Betsy's newbie roommate, caught up with Nicole, and eventually headed to bed for a hopefully-long rest.
I woke up early (thanks in large part to "first night in a new bed syndrome"), wrote up yesterday's trip report entry, and went through my conference swag bag. There's a (cute) coloring book which is pretty cool... but they didn't provide any colored pencils or crayons or markers to actually color in it. (d'Oh!)
Anywho, I finished off the last 2 ribs and the cornbread from last night for breakfast, then took a shower to get clean followed by a soak in a hot bath to help relax (and rehydrate the skin; the hotel is very dry). I headed down to the conference lobby and caught up with a few folks who got in late yesterday before sessions started, then I tried to finalize plans for the rest of the day.
I caught up with more folks, heckled the Build team (but only a little), and basically Hallway Tracked my way through the morning. My last-minute (ish) lunch plans fell through so I wound up mooching off the tutorial/workshop lunch instead. (Today's menu: Arugula salad with pears, walnuts, and feta; iceberg salad with julienned tomatoes, blue cheese, and bacon; cornbread; macaroni and cheese; braised brisket with caramelized onions; barbecue chicken; vegetarian chili with choices of cheese, scallion, and sour cream; apple cobbler; bread pudding; and vanilla ice cream.) I hung around after I finished to help someone with some mobility issues (at their request) and kept them company while they ate, and then hallway tracked the afternoon.
For dinner I wound up abandoning the conference and went out to Boston Burger Company with a former cow-orker; I wound up getting the Artery Clogger (deep fried patty, bacon, cheese, and barbecue sauce). After dinner we adjourned to the hotel bar (he had a beer and I had the Japanese whiskey which was surprisingly good). I checked out the board game night after that but it was pretty much three fixed-group games happening so I bailed and wound up Hallway Tracking in the lobby bar until 10:30pm or so.
Despite some tossing and turning I managed to sleep in until almost 6am. Caught up on work email (mainly by deleting the automated crap I haven't bothered to filter directly to the trash; I should get around to updating those filters some day). I grabbed a bagel and banana from the continental breakfast spread, chatted about job histories and technology changes with a couple of new-to-me attendees, and then headed off to find where I needed to be,
Today's sessions started with a workshop, "Going Beyond the Shrink Wrap: Inclusion, Diversity, and IT," facilitated by J. Goosby Smith (Ph.D). It was a little more intimate than I was expecting; there were only 5 people including the facilitator. Despite that, we had an educational and interesting workshop.
Diversity has many dimensions:
- Individual traits like personality
- Internal dimensions like age, gender, sexual orientation, mental and physical ability, ethnicity, and race
- External dimensions like parental status, marital status, geographic location, income, personal habits, recreational habits, religion or spirituality, educational background, work experience, and appearance
- Organizational dimensions, like functional level or classification; work content or field; division, department, unit, or group within the organization; seniority; work location; union or not; exepmpt or not; and management status
- Perceptual dimension, such as how you see yourself versus how you think others see you
Diversity is distinct from inclusion. Using a garden metaphor, diversity is the type of plants in that garden (trees, bushes, flowers) and inclusion is the soil (moisture, shade, nutrients).
The lunch menu today was Caesar salad; frisee salad; white bean and escarole soup; bread and butter; orichetti with basil, tomato, and mozarella; mussels, scallops, and monkfish in Israeli couscos with parsely; lemon chicken; an olive oil/polenta cake with fruit; and limoncello shots.
After lunch I went back to the Build space and worked on this trip report more. I also handled a couple of work issues. (Funny how despite saying to someone "I'll be out from now until a week from Tuesday" was interpreted as "I'll be available online to help you out." Luckily the one issue was a no-brainer on my part and the other was something the user solved for himself — and I still don't know if he realizes I'm out of town.)
After the afternoon break ended I hallway tracked for a (long) while then headed up to change for the 0xdeadbeef dinner, the group of us who go to an upscale steakhouse for (hopefully) a really good steak dinner. I wore my gray suit with a white shirt, red patterned tie, and burgundy shoes but I brought the wrong (black not burgundy) belt to go with it; oops. We headed out by taxi just after 7pm for our 7:30pm reservations at Grille 23. I wound up getting a caesar salad with extra anchovies, 18-oz. 100-day aged ribeye (medium rare), and as a table we split the truffle tots, mashed potatoes, grilled asparagus, and sautéed mushrooms. Most of us had a 2015 Salentein Reserve malbec with the main course. I had a hot fudge sundae for dessert and had a glass of not-a-sauternes dessert wine to go with.
We got back to the hotel, said Hi to some folks in the bar, and I went to change out of the monkey suit. By the time I got that hung up and things put away it was coming up on 10:30pm, so I did a quick run down to the bar for some hallway track before heading to bed around 11pm.
The 5am bio-break was sufficient to wake me up, so I did a little more work on the trip report, caught up on work emails, processed a long-scheduled change, then showered and dressed and headed down for the conference continental breakfast (today was breakfast breads and danish; I had a raspberry danish and banana).
Today started with a workshop, "Managing System Managers," moderated by Cory Lueninghoener. It was much better-attended than yesterday's workshop, with 12 attendees (plus Cory), about evenly split between current and non-managers. (Amusingly, 2 of the 3 participants in yesterday's workshop were in today's as well.) The workshop started because Cory was a new manager and didn't have any experience or someone to talk to about it, so he selfishly proposed this workshop to share best practices and build a professional network more than to solve problems. (The slides are online.)
Our first discussion topic was hiring. There's difficulty in hiring senior people: As they move on to other things you can sometimes promote juniors but sometimes you need to bring in a new senior. Senior people are difficult to get because they're rarely looking and often have their pick of places to land, and some places (especially .gov and .edu) don't have the money to pay for senior people. You need to be very upfront in the job description about what skills you want, what the salary range is, what the benefits are, and whether relocation is provided. (Not everyone is necessarily interested in the intangibles, but some do chase the paycheck. Compensation is nice, but you have to sell the senior candidates on more than that: Interesting work, furthering the cause of science, that kind of thing.)
What do we look for in hiring? It depends on the job, of course. For HPC SA you need some level of sw development skills, methodological approach to narrowing in on a problem, but knowing when to stop and switch tacks or ask for help. Do you only look for still-employed people or are resume gaps okay? Salary chasing can be a problem (especially in some non-US countries).
How do you find people interested in solving interesting problems with the drive to zoom in and tackle problems in a smart way. How do you get that out of an interview if you can't give a "code" interview? Roleplay or walk through a scenario, feeding them information that asks them to think critically to hunt down the issue. (Use trouble tickets as the basis for scenario development, and have a "dungeon master" drive the scenario in the interview.)
Someone has a team that went to 5 phases: Phone interview, look at a real VM, break/fix on the VM (incl. enable log rotation, debug a script, etc.) time-limited but with more stuff than most people can accomplish in that time (for time management).
Because senior admins are in short supply some places will "build not buy" and develop our own juniors, some of whom may be brought on board via diversity initiatives.
In terms of developing a junior (esp. if motivated): Help desk tasks are repetitive. He needs to take on the task of helping the help desk people, and rather than solve their problems he can help them learn how to solve the problem, then be aailable as a senior when HE has questions. It may be more that mentioring is necessary in the "develop a junior" sense.
- Delegate responsibility, encourage to automate, know/learn the config management system, etc.
- Jr has 2 responsibile: (1) Be avalable to help desk to solve their problems and (2) shadow a senior on the project.
- Give him authority over the project he's working on.
- Let them take something they're psyched about (e.g., Go, Docker, linux) and give them a low-risk project to bring that into the new environment.
Cory wearing his manager hat: He'll send meeting invites, accept meeting invites, and go to meetings. And consider bringing a junior along so they get eposure to the larger org AND visibility across the organization. You can use that as a basis to see who's lkely to want to progress (up to senior, into management, etc.).
Consider delegating meeting-chairing to others on the team (in a rotation?).
So we have people... but some leave (sooner or later): How do you deal with turnover? Turnover is short-term pain but good in the long term (fresh ideas). Someone has a division of 140, and they expect 10% to retire over the next year. How do you build up the infrastructure to handle that level of off- and on-boarding.
Someone has a churn across teams and have a "hack-a-month" to let someone work on team B while reporting to manager A for a month and then decide which team to keep working for, A or B, at the end of it.
That same someone has parallel tech/IC and management tracks up to VP-equivalent levels. At the higher levels it's architecture and then connecting people. It's like being a 4th tier in a 3-tier system.
How do you move up (into management or not) if the existing people don't go anywhere? There're fewer senior jobs so you can't promote everyone internally and may need to move elsewhere. Some places do a reorg and move people both up AND down in the org chart (with no change in compensation).
After the morning break, we talked about how to get OUT of managing. Someone was an IC (client tech support), and when his manager moved on he got promoted to the role. They offered him a global role which he declined, because he wants to be an IC not a manager. How do you stay technical for the year or 3 or 6 you're managing?
Some get acquired and move between IC and management during that process, or get laid off and switch companies. Can people (especially those in academia and government) do that?
- Ask for a 6-month try-out if $orkplace forces you into it? (Give a chance to fail without fear of repercussions.)
- Do the "acting manager" tasks.
Sometimes those leaving management become (technical) project managers.
Our next topic was people skills (with personnel problems or customers). Nobody teaches this. One is worried when he talks to someone as manager that he'll screw up their lives (too good, too bad, moving projects when they love the existing one, etc.) Have manager mentoring; follow/shadow. Read management advice (some of which is good and some is bad); see askamanager.org, and look for management training. (Some now-managers... aren't interested in being managers so training and reading won't work for them.)
Next we talked about incentives for retention: How do you keep people interested, especially if you can't give them more money? "Cool work" and "interesting stuff" only go so far. Benefits can help (e.g., government pension), but what else is there? What else do you like about the job; can you sell that? Ask the existing team members what they liked ("What appealed to you about joining the company") — "we offer a tangible product" or "I wanted to get in pre-IPO" or "you're solving interesting problems" or "you promote values I believe in" or "the stress level is much less" and so on.
Recognition: Honest and earnest thanks, even without a gift card. Managers should recognize what motivates people and use that to help out. You need to be consistent across the team — but everyone should get it when/if they deserve it.
The work itself can be an incetive; being involved in the cool work or interesting projects can be very motivating to keep someone around and involved.
Another company has "peer bonuses" where you can just give someone else a bonus (with constraints to keep people from giving them to each other). Peer recognition ("so-and-so sent kudos for this specific reason"). Those kudos can lead to cash payouts. Should managers at the weekly management meeting remind folks that certain spot bonuses exist? Teams can give other teams kudos.
Respect: Not just in the work, but in the ancillaries, can also work for both retention and receuitment. Are they nice and respectful? Do they promote good work/life balance? Do they expect you to read email outside 8-5? Do they send meeting invites for tomorrow morning off-hours?
Career Development (esp. conferences and training) are another recruitment and retention incentive.
That segued into mis-fits where a senior person joins a startup and is subsequently expected to do only junior tasks. Having things in writing doesn't always help. Lots of places don't have an hr boilerplate offer letter but even still won't put "You can attend a conference" therein because it's a de facto contract. Email is "not in writing" so it's not a contract, but you can frame the discussion. How can you flip the script?
We wrapped up with "who's hiring" for both managers or technical people with management experience. Many places are hiring techncial leads and middle- to director-level management.
Lunch today was caribbean themed: A cucumber/avocado salad, another salad with cilantro, vegetarian empanadas, jerk chicken with plantains, steak with a pico salsa, and mini key lime cheesecake bites. It was all perfectly acceptable (so I'm not complaining) but it really wasn't to my taste. Luckily I was still mostly full from the 0xdeadbeef dinner last night so I could nibble some of the proteins and carbs and call it a meal.
After lunch I hallway tracked with some locals who were visiting without attending the conference and eventually took a power nap. I wound up napping through the afternoon break, so I didn't grab any Coke Zero today, so I won't have caffeine for tomorrow morning.
For dinner I went out to Legal Seafood with Lois, Trey, Connor, and Todd. I had way too much food, in large part because they screwed up and delivered a lazy lobster (with the tail and claw meat removed from the body-n-legs) instead of the baked stuffed lobster I ordered. So I ate that — it was delivered so it's either I eat it or they trash it — before they realized they goofed and they brought the baked stuffed lobster I ordered (which still wasn't what I was expecting but at that point WTF). We finished up there just in time to hurry back to the conference hotel for the LGBTQA* (AKA Alphabet Soup) BOF. There were enough of us that we managed to get around the room once for introductions (name, location, how things are in general), and have some conversations... less about the LGB but more abut the T and Q, and how in a lot of places even those aspects are becoming non-issues... at least if you're in Seattle or San Francisco, anyhow.
After the BOF I ran up to the room to put my leftovers in the fridge and I headed back to the bar for socializing with folks until it was time to crash.
I wound up eating the leftover baked stuffed lobster from dinner last night for breakfast in the room this morning before showering and heading down to the conference space where they had croissants and muffins so I just grabbed a banana. I introduced some people to each other to facilitate someone's job hunt, and managed to dodge having to chair a session during the conference.
I got to the announcements and keynote session a bit early to be able to get a seat with a table and electricity, but this year only the gold and silver passport holders got either. I wound up running my power strip across the aisle.
David Blank-Edelman and Carolyn Rowland from the USENIX Board noted this was the 30th LISA and introduced our co-chairs, John Arrasjid and Matt Simmons, who began with the usual announcements and thanks. The program committee did not report the submission or acceptance counts. Thank you to the volunteers, staff, committees, liaisons, coordinators, speakers, instructors, sponsors, and expo vendors. Remember the BOFs are every evening and you can schedule one of your own in any open slot.
LISA 2017 (the 31st) will be October 31-November 5 in San Francsico and Caskey Dickson and Connie-Lynne Villani are the program chairs. (Feb 27 and Apr 24 are the early and hard submission deadlines, respectively. First-time submitters please use the early submission deadline.)
Nowadays the conference has 3 pillars: Architecture, Culture, and Engineering. Does this make sense as a categorization? Let the conference folks know.
No awards were presented during the announcements since there is no more papers track. LOPSA's Chuck Yerkes Award was not presented (it would be presented Thursday at the LOPSA Annual Meeting).
Today's keynote speaker was John Roese, Dell EMC EVP and CTO of Cross Product Operations. As the IT industry is evolving faster into the Digital Age, there are multiple facets being affected throughout the industry including people's roles in IT, the technology needed to support the newer applications and use cases, and even existing IT oriented processes. We will discuss the impact that this IT transformation is having on these aspects of the industry and you.
Change is happenining everywhere, the success of business is dependent on IT. New technologies enabe new applications and we need new IT skills and roles to anage that. We need to retool IT to emphasize continuous learning and mentoring.
After the morning break I went to the Invited Talk, "Modern Cryptography Concepts: Hype or Hope" by Radia Perlman (who incidentally developed the spanning-tree protocol). There are many topics that get a lot of press, some that are the focus of many academic papers but have not escaped into the popular press, and others that are covered in both. It is important to be able to approach these things with skepticism. Some of the ones in academic literature are so hard to read that even though they might be interesting, it would be hard for anyone outside of academia to understand. Some (like homomorphic encryption), have escaped into popular media, but without the true understanding of how inefficient they might be.
Some quick comments:
- Sharing a Secret — Break into shares (think horcruxes), you need at least k parts of the total n shares of S to recover S. She showed the trivial cases k=1 and k=n, as well as k=2. Math is in the slide.
- Bit commitment — Alice and Bob are getting a divorce and can only talk on the phone and they have to decide who keeps the house and will flip the coin. How do you flip it fairly so neither can cheat? Use a hash? Bob chooses random number where bottom bit is 0 for tails and 1 for heads and hashes the number h(R).
- Circuit model — See this definition of circuit.
- Full homomorphic encryption — Do ops on the encrypted data and the answer is still encrypted. The data conversion grows by a factor of 10^6 before the circuit conversions — so it's horribly impractical. Partially homorphic encryption is possible. Do via order-preserving....
- Blockchain — Just a word with no true common definition.
Lunch today was on the expo floor. I did a quick tour through (no major swag jumped out at me) then ran through the buffet. Today it was minestrone soup; Sheraton Boston Signature Salad of watercress, peppery arugula, lemon poppy seed croutons, candied pecans, crumbled goat cheese, and apple vinaigrette; chickpea and quinoa salad; shaved pastrami, Boggy Meadow Swiss, and grain mustard aioli on pretzel roll; hickory smoked turkey, oven roasted tomatoes, baby spinach, and brie with pesto mayonnaise on ciabatta; toasted warm grilled cheese with Vermont cheddar, cured tomatoes, and caramelized onion on farmhouse pullman; half sour pickle spears; individual bags of pretzels, popcorn, potato chips; and flourless chocolate torte with raspberry reduction and fresh berries. I met a few new people and caught up with some folks I've known for years.
After lunch I went to the invited talk "How Should We Ops?" by Courtney Eckhardt of Heroku. "How should we handle operations?" is one of the major issues in our industry right now. We've mostly agreed that consigning people to ops roles with no chance to develop more skills is bad, but the range of responses to this is wide and confusing. The proliferation of terms like DevOps, NoOps, and SRE are frustrating when we try to tell a potential employer what we can do or even when we just try to talk shop together. What we've done before isn't working well, but what can we do instead, and how do we even talk about it?
Heroku uses a Total Ownership model to address operational work. She talked about what this means in practice (with examples), the benefits (clear relevance of your work, good feedback cycles, abolishing class hierarchies), the failure modes (pager fatigue, decentralization), and how we can make all jobs in our engineering organizations more humane and rewarding.
She talked about DevOps being about the care and feeding of dynamic computer systems, especially infrastructure as code and service delivery. In most cases the loop is from developers who push code to production which pages ops who email the developers. But if the developers go on call, the loop beomes more like developers who push code to production which pages developers. Ops isn't in that figure, but we're the experts: We know all about infrastructure, monitoring, trend analysis, and burnout risks; we're the subject matter experts.
After that I switched rooms to the invited talk "How to Succeed in Ops without Really Trying" by Chastity Blackwell of Yelp. The last few years have been a time of immense change in the field of operations; not only are there new technologies popping up every day, but the entire way of doing operations seems to be changing. "System administration" jobs are becoming "devops engineers," "site reliability engineers," "production engineers," or something else that no one seems exactly sure how to define. The proliferation of cloud services is making it easier for some companies to avoid having any purely operations organization at all (or at least think they can). In this kind of climate, how does anyone, especially people without much experience, or who feel like they are years behind the curve, keep up with the pace of change? How do you make sure your skills are still keeping you viable in the job market? And how do you do all this without feeling like you're going to have to spend every waking moment keeping up? This talk discussed some of the secrets to be a successful operations engineer without sacrificing everything else.
She talked about 5 myths and the reality to counter them:
- Myth — I'm just a sysadmin; I'm not an SRE/DevOps.
- Reality — Job titles are new names for things sysadmins have done for ages.
- Myth — Everyone wants a coder; I don't code.
- Reality — All of your bash, perl python, and php scripts? That's code. You might not know the most efficient or elegant algorithms, or tests, or perfomance tuning, but that's all stuff you can learn.
- Myth — "I run some webservers and MySQL database for a small company so I can't compete with someone working for Google or Facebook."
- Reality — Not everyone is Google or Facebook and doesn't have that kind of scope or scale; Google has 62K and Facebook 12.5K eployees worldwide but there're 6.7M tech employees in the US in 2015. There's also advantages of working at a small shop — less likely to get stuck in a bubble, more exposure to different parts of the application stack, and can make a greater impact and get better feedback.
- Myth — I can't apply there. $famous_person works there so why would they hire me?
- Reality — Not everyone can be or need to be a visionary or genius. The most imporatant part of Ops is being able to get things done.
- Myth — I have to work 80 hours per week just to keep up and don't have time for anything else.
- Reality — You shouldn't have to sacrifice your life or sanity just to keep up in your field. Work/life balance is important too.
Remember to keep it simple: Don't make things more complicated than you have to. Comparing yourself to bleeding edge doesn't make sense. You can't fix everything (either with the infrastrcuture or yourself) right away. Those big companies may have more people in their ops teams than you do in your whole company. It's okay to be boring.
An 80% solution now is better than a 99% solution in a month (or six months... or after the heat death of the universe). Worst case you'll have to go back and fix it later.
Learn.
Protect yourself from burnout.
At the afternoon break I grabbed a much-needed Diet Coke. I skipped the house-made granola-esque bars though.
In the last session slot of the day I went to more of the invited talks. First up was "Behind Closed Doors: Managing Passwords in a Dangerous World" by Noah Kantrowitz. Secrets come in many forms, passwords, keys, tokens. All crucial for the operation of an application, but each dangerous in its own way. In the past, many of us have pasted those secrets in to a text file and moved on, but in a world of config automation and ephemeral micro-services these patterns are leaving our data at greater risk than ever before.
New tools, products, and libraries are being released all the time to try to cope with this massive rise in threats, both new and old-but-ignored. This talk will cover the major types of secrets in a normal web application, how to model their security properties, what tools are best for each situation, and how to use them with major web frameworks. He spoke very quickly to a mostly-full room.
The second half of the invited talk session was "The Road to Mordor: Information Security Issues and Your Open Source Project" by Amye Scavarda of Red Hat. From time to time, communities will run across information security incidents. In the course of project expansion, it always seems like a good idea to wake up a new instance of Something_With_A_Database and not write down the credentials or think very clearly about what the permissions are on that new instance. If you're involved in open source for any length of time, you're going to discover a hack at some point in time. However, the Lord of the Rings is a great model for being able to deal with your information security issues. (Often? We're Frodo.)
She covered:
- The forging of the ring, or how this stuff happens in the first place
- How Gollum became corrupted: What happens when you don't work in a timely manner to resolve these things
- The cast of characters: Someone on your team is going to be Gandalf. You might not always have a ranger who comes out of the shadows and saves you
- The journey to Rivendell: What effective discovery on an information security looks like
- The council of Elrond: What to do after you've gone through discovery and now you need input
- The mines of Moria: What happens when you don't do a thorough discovery, and/or information comes to light that should not have been forgotten
- Getting waylaid on the road: Challenges within the team and balancing out different needs around disclosure and resolution
- Good grief, Boromir: Someone who has different ideas even after the Council of Elrond
- Actually getting the ring to Mordor: Resolution/launch, disclosure
- Going back and cleaning up the shire: Making sure you're in a better place at the end
Her intent was to provide a rant and set of stories, and asked the audience to share their stories as well.
After the session I headed to my room to drop off my bag and grab my sweatshirt and coat to head out to Fogo de Chao for dinner with Mike C., Lee, David N., and another gentleman whose name I forgot to write down. It was delicious and less expensive than I remembered though it still blew my per diem out of the water. Got back to the hotel late for the LISA 2017 Planning BOF, but did catch the last half of it. That was followed by planning for the Scotch BOF to be held Thursday, then hallway tracking at the bar until 11pm when I headed upstairs to bed.
Aside from a 2am bio-break I slept in until almost 6am today. Shaved, showered, popped the morning drugs, and went down to the continental breakfast. Grabbed a banana since there was only butter and jam, not clotted cream, to go with today's carbohydrate (scones).
Today's keynote was "Looking Forward: The Future of Tools and Techniques in Operations" by Mitchell Hashimoto of HashiCorp. We're currently undergoing major changes across development and operations that are pushing the boundaries of our comfort zone. While the change keeps coming, trends and practices have been emerging that show promise as the way we can tame this complexity. In this talk, I'll present the changes we're seeing, why we're seeing them, the ideas being introduced to manage this change, and the glorious future we're all heading towards.
His intention in the talk was to add signal to the noise so you know what's happening that's new and interesting.
The modern data center in his view tends to be increasingly complex, with more moving and different parts, from the monolithic maintframe all from one vendor, through multiple vendors and the move to client/server then abstraction layers like virtualization, to containers on top of those, and finally outsourcing functionality to the clouds (including DNS, CDN, and databases). "Multidatacenter" used to be limited to those who could afford it. Now it's much esier thanks to the cloud (even though it's not as simple as "press a button and duplicate it"). The complexity contiunes with IaaS, PaaS, and SaaS (and most places are using all three). Finally, there're mutiple operating systems in use even in mostly-homogenous sites.
The tools need to acknowlged the reality of that complexity: Local or remote, VM or not, container or not, single or multiple OS, and so on. Why? We need to effectively deliver and maintain applications.
What are the emerging best practices in this "modern data center" world?
- Heterogeneity — The idea is that more technologies, not fewer, will be in use. Homogeneity is the idea of a single paradigm, a single tech stack, and a single cloud provider, focusing everyone's energy on making the thingy way better. But in reality that's very unlikely. Things will continue to change for the better (though multiple iterations may be required and changes aren't atomic). Because things will be different and changing we may as well embrace things: Choose tooling that and hire people who align with that multiple-paradigm reality and be mindful of change. There are three types we care about:
- Multi-cloud — Motivation is that regional availability with associated legal issues, feature sets of the various cloud providers, pricing (straightforward though artificial), lock-in avoidance.
- Multitech — Motivation is the right tool for the job, the job market, and forced requirements (like SDK, walled gardens, and so on).
- Multi-compute — The meaning here is "physical versus VM versus container." Motivation is requirements (performance, legacy, security, and so on), and existing knowledge and experience.
There are challenges: Management of resources and of costs, and the speed of light means data can only move so fast. There are some solutions to these: For the former, tools for workload management is Terraform, Kubernetes, and for the latter, using unrelated workflows on different technologies.- Data center as computer — The idea is that a data center is a single shared resource, not a collection of devices. Say "We have a data center with cumulative capacity of so many CPUs, memory, and IOPS." The good news is today developers treat DCs as monoliths anyhow. The test is if a rogue person in the data center unplugging servers makes someone in the organization panic. The reality of the situation is that the number of servers and applications we want to deploy is growing at a breakneck pace. The value per application is going down (it used to be one big monolithic application that make or save a lot of money; now they're smaller and make or save less). The major choices for keeping up are:
- Growth will slow down. This isn't likely in the short term.
- Train and hire more people. This isn't really scalable.
- Lower the cost/complexity of managing the servers and apps.
- Data Center as Computer — The idea is to separate the application from the data it runs on. You give the application a requirements spec (like APM, CPU, disk, if it has to be PCI compliant, maximum latency, and so on). An automated system takes the application and spec and places it in the right place or alerts if there's insufficient capacity to meet the spec. That allows for choice in lower-level components. Challenges include that managing it is complex, not all applications fit this model easily, and there's a lot of trust assumed (because you're giving up control). Solutions to this include (cluster or applicaton) schedulers like Kubernetes and Nomad, but these are complex and take years to mature; state tracking, ephermal because state doesn't matter versus sticky where data can move but it's painful versus persistent where the state moves with the data (though other schedulers may use different terms); and more heterogeneity. Tools like Terraform can manage this.
- Declarative — The idea is goal-driven tooling where you describe where you're going but not how. The reality is that change requires a burden of knowledge that's much too large. The choices are:
- Hire or train more experts to scale with the number of requesters.
- Attempt to enable non-experts to do self-service through higher level abstractions.
Declarative is one way to do those higher abstractions. You describe the goal, not the steps, and trust something else to get there. Benefits are flexible choices for how to do it; it's easier educate because knowing what you want is usually easier than how, and complex tasks are easily modeled. (Terraform can do this.) Challenge here is trust: You have to trust the system to get you to the desired end state, you have to trust in the loss of control, and it's a difficult problem with large scope. Solutions here include schedulers like Terraform and EC2 (specificlly the "run_instances" API) and using "plan" as a fundamental primitive to show the no-op.- App-Level Security — The idea is to use this in addition to network and server levels as a core primitive. Historically it's a very us-versus-them with outside bad guys, so we'd have a global firewall (external versus internal) and server firewalls (iptables and route tables), but these were handcurated... and internal communications were unencrypted. The reality is three-fold:
- Clouds blur the us-versus-them line. From a security standpoint we can't blindly trust physical buildings as "us."
- Threats come from within. Assuming only-external actors is bad, and internal actors can be both malicious and accidental.
- We have orders of magnitude more applications being deployed.
Choices here are:
- Manually curate all the rules all the time, and hire or train more experts to do so.
- Enforce security at the application level to enable less-expert people to make more-isolated decisions.
There are three trends here: End-to-end encryption, use identity (authentication) for all connections, and all requests are checked for permission (authorization). That enables efficient growth; you still need to do perimeter security, but now application security is now at the application level not at the server or organization firewalls. The challenges are that TLS is hard, identity is hard, and modifying apps to use this model is hard. Luckily there are some solutions: Have a "security server" (like HathiCorp's Vault) as a single point of trust, schedulers can orchestrate setup and the secure introduction. ("Security is all turtles and we won't go through them all.")- People — The idea is that Dev, Ops, and Sec must work together. Allt he preceding is about enabling less-expert people to do the tasks (like deploy and create) and make significant changes safely.
(Note that this list is not exhaustive. For example, he omitted monitoring and that's "probably important.")
The glorious future is when all this works perfectly. Note that it's focused on workflows not specific technologies. A lot of the preceding is stuff that's happening today, both at the hobbyist and enterprise levels. What about the future future?
"Serverless" (a misnomer, but we won't care about servers for some value of "we"), constraining the unit-of-deployment from the entire application to something smaller. Why? Application development is now more time consuming that deploying the application, miroservices to reduce the overhead logic, and cost. Potential applications are microservices, business intelligence, and batch jobs.
What does this mean for us?
- Short term, probably nothing
- Too many unknown still
- Heterogeneity means everything else will still exist
- Servers run underneath "serverless"
After the morning break I went to one of the invited talks tracks, in part because I'm friends with both speakers and in part because I helped the session chair write the introductions (and we advised him to introduce each one as the other). First up was Branson Matheson's interactive and entertaining "TTL of a Penetration." In the world of information security, it's not a matter of how anymore... it's a matter of when. With the advent of penetration tools such as Metaspolit, AutoPwn, and so on, plus the day-to-day use of insecure operating systems, applications, and Web sites-reactive systems have become more important than proactive systems. Discovery of penetration by out-of-band processes and being able to determine the when and how to then mitigate the particular attack has become a stronger requirement than active defense. I will discuss the basic precepts of this idea and expand with various types of tools that help resolve the issue. Attendees should be able to walk away from this discussion and apply the knowledge immediately within their environment.
The second talk this session was "What I Learned from Science-ing Four Years of DevOps" by Nicole Forsgren. Four years, over 20,000 DevOps professionals, and some science... What did we find? Well, the headline is that IT does matter if you do it right.
We discussed ways to make our data better, some surprises their team has seen over the years, and some highlights from the research: With a mix of technology, processes, and a great culture, IT contributes to organizations' profitability, productivity, and market share. They also found that using continuous delivery and lean management practices not only makes IT better — giving you throughput and stability without tradeoffs — but it also makes your work feel better-making your organizational culture better and decreasing burnout.
Lunch was on the expo floor again today, and was tomato-basil soup; romaine salad with tomato, red onion, avocado, bacon, and blue cheese; orichetti with dried cranberries (among other things); turkey and swiss sandwiches; smoked ham and cheddar sandwiches; sour pickle slices; and potato chips. No dessert this time. I wound up having an interesting conversation.
After lunch I went to Thomas Limoncelli and Christina Hogan's invited talk, "Stealing the Best Ideas from DevOps: A Guide for Sysadmins without Developers." DevOps is not a set of tools, nor is it just automating deployments. It is a set of principles that benefit anyone trying to improve a complex process. This talk will present the DevOps principles in terms that apply to all system administrators, and use case studies to explore their use in non-developer environments.
Just like with applying the principles of punk rock to music or poetry or literature, you can apply the principles of devops to software engineering or employee onboarding or failover to get good things like higher productivity or reliaility. Those principles are:
- The 3 ways of DevOps A way to improve complex processes. For example, the siloed "project manager -> dev -> qa -> ops" flow. The three ways are:
- Improve system thinking (improve the process)
- Amplify feedback loop (improve the communications)
- Develop a culture of continual experimentation and learning (make it possible to try new things)
In a non-SWEng environment, it might be "recruiting -> hr -> it -> team" instead.- The small batches principle — Artisanal coffee or bourbon is "better." Here, doing the work in small batches is better than big batches; "work check work check work check" is better than "work work work check." In practice they force failovers to keep the "small batch" of changes between primary and DR environment manageable.
- Minimum viable product (MVP) — Do the absolute minimum for the delta every time, over and over, more quickly, and incorporate feedback. This can also be applied to other features than software releases with smaller feature batches.
After that talk I hallway-tracked for a half-session since nothing jumped out at me. I wound up having a "future of LISA" meeting with a couple of board members (Carolyn, Cat, and David Blank-Edelman) and past chairs (David Blank-Edelman, Mark Burgess, and Tom Limoncelli). We came up with some thoughts that the board members will take back for discussion at the post-conference wrap-up meeting.
After the break I went to the invited talk "Implementing DevOps in a Regulated Traditionally Waterfall Environment" by Jason Victor and Peter Lega of Merck. (The talk was also known as "DevOps Culture in Life Sciences.") DevOps is adopted in so many places, and its benefits are well documented, but despite this, it is not getting the same traction in regulated environments like healthcare. Is it truly impossible to implement DevOps at a regulated company when someone else makes the rules? Or is it possible to both challenge the status quo and still adhere to essential compliance and risk requirements.
They provided why regulated companies like Merck — a 125 year-old pharmaceutical company — are challenged to change course. They explained the complexities of some of these regulations to get a better understanding of the challenge, and how the "path of most resistance" becomes the default release management strategy trap.
They're midway on their multi-year journey to augment our traditional, waterfall methodology with DevOps/Agile culture and methodology. They talked about our approach, our tool chain, and how we changed peoples' minds from "that will never work" to "that's the new way to work."
Regulatory processes are to keep people healthy and safe, so changes are usually slow. There's a perfect storm now:
- There's exploding sources of data and methods: IoT, genomic, historical, personalized medicine, data-driven invention, and open data.
- The arsenal of technology is growing: Big Data tooling and infrastructure, mainstream platform-driven ecosystems and economies, collaborative-based tooling, quantum, and open source.
- Talent & culture: Agile career mindset, cross-functional organization, crowd sourcing, and DevOps.
Impact: Today it takes 14 years from discovery (10,000 possibilities) to launch (1) of a drug. People suffer during this time, so reducing that even by a year would have a huge impact.
They believe the 4 key factors are fostering a collaborative community, standing up a DevOps platform, improving and gathering insights, and overcommunicating. (They went into more details on each, though I didn't take detailed notes.)
My last talk of the day was Jon Kuroda's "Catch Fire and Halt: Fire in the Datacenter." What do you do when you have a fire in the datacenter that takes your entire organization down until you can recover? Well, we found out the hard way when, on Friday, September 18, 2015, one of our research group's servers caught fire at the UC Berkeley campus datacenter, thus activating the facility fire suppression and emergency power-off systems and causing the outage of nearly all campus-hosted online services with recovery efforts lasting through the weekend. We will detail the circumstances surrounding the incident itself, examine the post-mortem process that followed the incident, and compare our experiences with those of other engineering disciplines after the occurrence of a critical incident. (Slides at tinyurl.com/firetalklisa2016.)
After sessions ended we were at loose ends for an hour or so until the conference reception at 6:30pm. I basically went up to my room, dropped off my bag (and plugged in the laptop so the battery wouldn't drain too much), swapped out my shoes, and headed back downstairs for the reception. The food was good — they had an Italian selection (caprese salad, gnocetti sardi carbonara with lemon cream and shaved parmesan, meatballs, olive and tomato salad skewers, and risotto arancini stuffed with fresh mozzarella and tomato coulis), a New England selection (clam chowder, crab cakes, and lobster rolls), and some generic world-traveler stuff (fish and chips, miniature shepherd's pies, and spinach and artichoke wonton tarts). Desserts were mostly bite-sized: Assorted cannoli, flourless chocolate tortes, key lime pie, red velvet cake squares, and white and dark chocolate cheesecake lollipops. (I've been pretty impressed with the hotel catering all week, honestly. They've been really good.) There was no silly theme or games or overloud music or anything, so from my standpoint it was a success.
I schmoozed with the office and A/V staff, and got to participate in some fun. Hilary said that a member of the Sheraton catering staff was the best; he demurred. So she said "Second best," and I grabbed a Second Place ribbon for him. Later she was feeling up Rich from MSI (the A/V team) and commenting about his soft skin so I had to tweet about it. (She's right; his upper arms are very soft and he doesn't shave that part of his body, so....)
I blew off the LOPSA annual meeting, but wound up attending the LOPSA After Dark party in the lobby bar, since (a) the de facto scotch BOF had no actual venue and (b) despite being invited by both chairpeople to the organizers' party I was informed shortly before its start that it was indeed for organizers only and I was uninvited therefrom. I wound up having good conversations over a caipirinha and would catch up with the rest of the folks from the organizers' party at tomorrow's Dead Dog so it's not like it was a big loss.
Today's a happy/sad day as it's the last day of the conference program. I woke up (before the alarm again, thanks body clock!), showered, caught up on work email (and a meeting that had been on Tuesday got moved to Monday... by someone who can never make a Monday meeting?), filled out a couple of surveys, registered for an internal health-maintenance program, then got dressed and headed down to the conference space for the last continental breakfast... and realized that despite the web site saying it opened at 7:30am like the rest of the week it actually opened at 8am instead. (Today was breakfast pastries; I had a raspberry turnover and a banana.)
Today's sessions began with Jane Adams' "Identifying Emergent Behaviors in Complex Systems." Forager ants in the Arizona desert have a problem: after leaving the nest, they don't return until they've found food. On the hottest and driest days, this means many ants will die before finding food, let alone before bringing it back to the nest. Honeybees also have a problem: even small deviations from 35C in the brood nest can lead to brood death, malformed wings, susceptibility to pesticides, and suboptimal divisions of labor within the hive. All ants in the colony coordinate to minimize the number of forager ants lost while maximizing the amount of food foraged, and all bees in the hive coordinate to keep the brood nest temperature constant in changing environmental temperatures.
The solutions realized by each system are necessarily decentralized and abstract: no single ant or bee coordinates the others, and the solutions must withstand the loss of individual ants and bees and extend to new ants and bees. They focus on simple yet essential features and capabilities of each ant and bee, and use them to great effect. In this sense, they are incredibly elegant.
In her talk, she examined a handful of natural and computer systems that illustrated how to cast system-wide problems into solutions at the individual component level, yielding incredibly simple algorithms for incredibly complex collective behaviors.
Put a few ants on a table they'll wander around until they die of exhaustion. With enough ants (500,000) they'll look to form a nest or find food or build bridges or whatever. That's the kind of behavior she means by "emergent." In fact, this is the canonical example of emergence in complex systems. She was considering cities or transit networks as richly-complex networks with deep data sets. Insted she became interested in the data itself, the challenges of working with it.
Complex systems and emergence — what they are and aren't — are the thrust of the talk today. And some of the examples are complex systems made up of other complex systems. (Philosophically, it's
turtlescomplex systems all the way down.) However you can take the lower levels for granted, so you can focus on (e.g.) the colony and not the individual ants therein.Complexity describes the behavior and captures the actions ranges and state. So what's emergence then? Emergent phenomena are the actual behavior and emergence is the frequency of behavior in those complex systems.
There's also both organized and disorganized complexity. The distinction between them is how you can describe the behavior of the whole. The latter can be statistically described (e.g., a container of gas); the former can't (e.g., a marketplace). Both have a bunch of parts and a set of rules that govern behavior, but only the organized complex system can show emergence.
How does emergence happen? Per Bak, Chao Tank, and Kyrt Wiesenfeld (BTW — yes, really) published a lot of research on this, including self-organized criticality.
Just like an ant colony, computational systems need to be simple, distributed, and scalable.
Forager ants: They wait to go out until one comes back and based on the timing decide whether to go. If there's food nearby there're more frequent interactions and they go out more quickly, and if there isn't there aren't and they don't. It's the interaction (antannae-touching) that determines behavior. Basically? They're doing TCP (at least the congestion-handling parts): You don't send another ant (packet) until one comes back, and slow down if they don't.
Starlings are another example. (Watched the 4-minute video from keepturningleft.co.uk by David Winter. See also https://www.wired.com/2011/11/starling-flock/.) How do they interact? Each monitors their 7 nearest neighbors for changing speed and direction in murmurations. "7" maximizes robustness, which lets them reach consensus in the evidence of uncertainty.
The solutions — forager ants doing TCP congestion handling and starlings doing robustness — are evolutionary successes. We're seeing nattural selection for collective behavior. Evolution isn't favoring "get the most food" but "colonies that hold back in unfavorable conditions." The behavior is inheritable; child colonies (far enough away to no longer interact) behave the same way as the parents'.
We've seen many biological analogs for computational problems. The ideas of complexity and emergence aren't new ideas at this point. Another: Slime molds and network routing.
We don't really have biologically-inspired computational systems yet. Why? Biological systems yield the solutions over many cycles of evolution, but you can't look at solutions and apply them to computers without first looking at the conditions for why that solution was right. There's also a problem of representation, specifying ALL of the aspects and how they ALL interact. If it's not well specified the model falls apart.
Top-down feedback is another problem in applying behavior to computers. An exmple is early web interlinking: "There's content but if you don't know the address how do you find it?" How do you know it's best or right? Google took the linking behavior and used it to rank pages on the assumption that something most-linked was best or right or whatever. (Emergence: No individual user determines the top result, but every user does.) Top-down feedback isn't really used in computational systems.
In summary:
- It's hard.
- There's a lot going on in complex systems.
- The concepts are simple and rely on limited information.
- Simplicity and abstraction gives rise to incredibly diverse set of behaviors... which is hard to apply to systems and software.
- Understand what the success criteria are and what the consequences are.
After the morning break I went to the invited talk "Lost Treasures of the Ancient World" by David Blank-Edelman. In the deep, not-so-dark recesses of a former employer's data center lives an ancient server. This server was central to their infrastructure for years before he arrived and was still in active use after he left, 19 years later. Scared yet?
With the permission of its owner, David began an archeological excavation with this server as his dig site. What could he learn by studying the contents of a machine that was the backbone of the environment for at least 25 years? How has system administration changed over that time period? How has it stayed the same? What mistakes were made? What have we learned since then and what have we forgotten? Could it help us understand the future of our current state-of-the-art practices? All of this and more.
Dead servers tell no tales, but this server isn't dead yet. What does the past want to tell us about our future? In short, not that much.
The talk was run as as a choose-your-own adventure... with Calvinball references!
There's an RCS-based text file that's a database of host information that can generate DNS among other things. He found the 1995-02-18 RCS entry (Sparc 10/30, SunOS 4.1.3, 32 MB RAM). (In 1998 it sold used for $1,300. Today a Pi with much much higher specs is $36.) It's now a SunFire 280.
The next revision was a Sparc 20 as 4.1.4 and 64 MB (1995-07-21).
Next was [hostname]-new, E450 runnning Solaris 2.6 with 256 MB memory in 1998; in 2000-12-12 it was decommissioned as [hostname]-byebye. CNET had it for MSRP $13,950.
Next, on 2006-09-26 E280R and 2 GB memory on Solaris 8. Went live on 2003-05-22 E280R with Solaris 9. That's still the one there in 2016-12-09.
It kept user info (GCOS, UID, GID), was the NIS master, certs, privileged information (like user complaints), and was the trusted-by-everyone machine. It didn't have to be up for the environment to run, just to change it.
They once managed to zorch one of the files that's essential for root access, possibly /etc/shadow. "How long will it be before anyone notices my group can't do anything?" They wound up realizing that the nightly scripts run as root on every machine to check if the disk was full. The GNU version of du wasn't built by Sun, but by third party humans... like dNb, who owned the file. He replaced it with another program that would fix the problem on every machine and run du.
The machines ran "environment-ify" itself to make sure everything was locally-correct (pre-config management). They also were self-healing since the /etc/shadow? They used "watchme" to fix the machine if there were problems.
There were very env-specific things on it, including /opt/csw that was a community-driven package-based distribution system like SunFree. If you ran one of these you had to run ALL of them). Also they had /priv for machine-local private stuff (admin tools, machine-local data, sensitive data, etc.) and /arch for architecture-specific things. They also kept a news program for motd-like things that tracked whether a user had seen it on a per-machine basis.
(Aside: What did it actually do? It was the admin host, the source of truth, an NTP host, did the privileged processing, and basically was the basket holding all of the eggs. And it was also the harbinger of technical debt.)
/something/something/user-problems contained problem reports about the users. They now have 25 years of that. The biggest problems were IRC and users filling disks with porn.
What can we get out of this? They did a lot of cool stuff. It was insecure and unwise in a lot of ways. File systems can't tell you why but they can tell you what and when, kind of. It takes a community to build a machine and make it useful. And users haven't changed in 20 years.
The last conference lunch followed that talk; today's menu was beet and quinoa salad, radicchio and endive salad with candied nuts, lemon-butter cod, herb-roasted chicken breast with shiitake mushrooms, roasted bliss potatoes, boston cream pie, and seasonal fruit crumble.
After lunch was the penultimate invited talk, "An Admin's Guide to Data Visualization" by (LISA conference co-chair) Caskey L. Dickson. Go beyond the line and bar chart. He provided the essentials of presenting complex numerical data in a meaningful and actionable manner. Don't just toss up a table of unintelligible numbers. Use that information to tell a story and do it in a way that is compelling, not confusing. Learn the techniques and pitfalls to convey real meaning with your valuable data.
This talk covered the basics of data presentation including common techniques and pitfalls. The goal is to move people beyond the wall of numbers and enable coherent visualization of large data sets.
If you've ever been in a presentation where a wall of numbers is thrown up, leaving it up to you to find the meanings and trends, then this talk showed you how to convert that data into a story. Commonly used graph types were covered as well as discussion as to when and how each type is most appropriately used. Examples were provided of both good and bad cases of data presentation, and attendees got both an understanding of how to present data effectively as well as the psychology of how people interpret visual data.
Topics covered included:
- Why pie charts should die
- Common graphs (line, bar, scatter)
- Advanced graphs (box plots, cycle plots, trellises, and nightingales)
- Aspect ratios and "banking to the 45"
- Scales and axes
That was followed by "Interplanetary DevOps at NASA JPL" by Dan Isla. At the Jet Propulsion Laboratory, real-time analytics for data collected from the Mars Rover Curiosity is critical when millions of telemetry data points are received daily. Building portable containerized data systems and tools that can be continuously deployed enables our Systems Engineers and Data Scientists to quickly develop, analyze, and share their visualizations and algorithms. With the AWS GovCloud region, export-controlled data can be securely stored and processed using the familiar AWS services and APIs that scale on demand. Containers, DevOps, and high levels of automation are the most important concepts when building infrastructure at scale that can be robust and operated by just a few admins. DevOps is more than just automation and fancy tools and is really about culture change within the organization. At JPL and other government agencies, legacy is everywhere from the apps to the ops; with the Analytics Cloud Services, we have successfully demonstrated ways to modernize legacy systems using containers to make them more secure and operable on modern infrastructure. In this talk, Dan will share how his team revolutionized Interplanetary Mission Operations and created a new paradigm for software development and collaboration at JPL.
After the afternooon break the last session began with some announcements. Speakers were reminded to send their slide PDFs for publication on the website, and attendees were asked to provide feedback via the surveys and to email a specific address with questions. The winner of the Capture the Flag contest from the security class was announced. People wishing to volunteer to help out at LISA 2017 were asked to email the chairs. The Build team had no food poisoning this year and reported 1128 IPv4 clients and over 200 IPv6 clients.
The conference concluded with the plenary session, "SRE in the Small and in the Large" by Niall Murphy and Todd Underwood. SRE is often perceived as a useful but relatively narrow role only appropriate for large scale systems engineering in very large organizations, and irrelevant to everyone else. We tried to convince us that at its core, SRE contains a set of principles that apply as easily to a single-person startup as they do to Google. Along the way, they also tried to produce some evidence that even our bosses might find compelling about why our organizations should adopt some SRE standard practices.
SRE doesn't have to be in only large companies like the Googles. Some other places are Amazon, Dropbox, Ericsson, Etsy, Facebook, LinkedIn, Microsoft, Netflix, and Pinterest. The concepts — especially small early changes now make a better future later.
So we've heard stories. Now let's generalize to 7 principles, approaches, techniques, or suggestions that you can SRE at any size:
- Load tests considered essential — The objection is this is unrepresentative toil. But you'll have one, either run by us or by our customers. The advantage is you can stop your own if things go badly. And even unrepresentative tests can tell you good information about subcomponents. Maybe can wait until later but not required initially.
- Export application state to some kind of monitoring and logging system — Adds time to implementation by adding printf() to the code. Judgement about where to place that implementation is tricky; you don't want every if statement to be instrumented. You don't want to overdo disk I/O. Libraries make things easier, man failure modes are application-specific not system-visible. High-signal places: Anywhere doing input and output and progress through the process. Clear win.
- Monitor now eith (possibly bad) metrics now — Objections are monitoring the wrong thing can send you into bad paths and monitoring can be slow. A small collection of well understood metrics is still valuable even if imperfect. Vet the time series being added to keep volume and rates sane and dashboards fresh. ry Prometheus. Monitoring pays consistent and often large enduring benefits. You'll have do it anyhow. Clear win.
- Know and track dependencies — This can be a library, service, authentication system, storage environment, whatever. You have to track them and ideally prevent or alert on new dependencies either on our own service or on which our service depends. Reasonable efforts give reasonable results. The objective is not to SECURE the infrastructure but to HARDEN it. Prudent dependency auditing and prevention can be built into sandard libraries or services and incrementally deployed. It may not be worth it and can wait.
- Shard (databases) early and often — Some dislike this one. Sharding is data storage. If you expect it to grow and want to scale, then dividing it among multiple shards earlier is easier. (You can reshard later to increase the number.) Precious servers are dangerous. Future growth is difficult to predict. Early sharding is cheap, but later sharding is very risky and expensive. It's hard to guess what to shard on but a reasonable guess is better than none. It's cool but hard to get right, so plan it carefully.
- Distributed applications require distributed consensus — Use a configuration system like Paxos or Raft. However there's compelling evidence that it's hard to implement right. If you need to distribute computation or state that needs to be distributed and agreed on something will go wrong and this is your ownly hope. Failure modes are pretty awful. Distributed consensus is not optional; do it.
- Identifiers are cheap future isolation — Every thingy you send out should have a label or identifier so you can address them later. For example, you can separate batch or job-engine traffic from interactive traffic. Keep that label throughout the infrastructure. If the objection is "adding things make the code more complex" it's probably worth it given the risk/reward. They're also really self-debugging. Worth it and pays off.
The general objection is "Keep it as simple as possible (but no simpler) and this is too complex for now." Much of the infrastructure required to do this stuff is smaller than expected and easy to put into place.
(Microservices considered useful.)
So towards a general theory of automation (and comics). XKCD 1309 and 1205 have some untrue assumptions that make them untrue though still funny.
Some takeaways:
- Monitor what you can. Now.
- Export application state wherever possible.
- Add identifiers to your call stack and RPCs.
- Use distrbuted consensus for distributed computation.
For dinner, I went to Summer Shack with Todd, Trey, John, Dan, and Lisa. I got the lobster bake which was plenty of food, especially since dessert would be at the Dead Dog party later in the evening. I got to the party a bit before 9pm (since one host told me to go up since it was open but then the note on the door said 10pm but I ran into the other host in the hotel lobby... Ugh), drank some wine, drank some scotch (Balvenie 16 triple cask), ate some imported Swiss chocolate, schmoozed a lot, and bailed around 11pm to go to bed.
On my way across the lobby to the other tower's elevators I saw several attendees having a drink in the Sidebar lounge before they headed up to the party. I chatted with them for about 20 minutes before heading upstairs to crash.
Today was a "free" day. I slept in (though not as much as I'd've liked; thanks, diuretics).
For the first time all week I had a hot breakfast. Branson loosely organized a post-conference breakfast; I joined him and Lisa a little before 8:30am at Cafe Apropos in the hotel. I wound up getting the $25 breakfast buffet and had a ham bacon and onion omelette with cheese, breakfast potatoes, sides of bacon and sausage, strawberries and pineapple and a banana over the not quite 2.5 hours we were there. (Lisa was there first. Branson and I grabbed a 4-top next to hers. Over the next couple of hours we turned it into a 10-top and had no fewer than 16 unique individuals there.) We got to sit there through the evacuation of the upper floors (since we're on the ground floor and weren't told to leave) while the fire alarm sounded
I should point out we were mildly hysterical over something on the conference Slack channel this morning. Caskey Dickson, one of next year's co-chairs, had given a talk about data visualization and part of that is how much he hates pie charts and really hates donut charts — pie charts with a hollow center — even more. He then posted this:
I'm somewhere over Ohio and a kind passenger had the attendant pass me a note about the rainbow out the window that looks like a donut chart. And just to confirm, there was no rainbow. It was masterful!
After breakfast broke up I spent the rest of the day hanging out with some local friends who aren't affiliated with the conference or in tech.
I had a slightly late lunch with one local friend (Kevin) at 5 Napkin Burger since (a) it was nearby and (b) I hadn't been there yet this trip. I had the bacon-cheddar burger and it was as tasty as I remembered.
After that and a power-nap, I pre-packed: Everything, including the things I picked up at the conference, fits in the suitcase and there's room for the things I can't pack yet (like "today's clothes").
Around 5:30pm I met up with another local friend (Tom) for dinner at No Name Restaurant on Fish Pier, since when you're in New England you should go to the seafood places and this is apparently one of the best in the country. I wound up getting a cup of fish chowder (delicious) and the broiled seafood platter (salmon, scrod, swordfish, shrimp, and scallops) with fries (also delicious though the swordfish was perhaps a little overdone). We adjourned over to Assembly Row in Somerville for J.P. Licks. I got 2 scoops, 1 each of salty caramel and extreme chocolate and they were fantastic.
(I should note that I completely blew my per diem to hell. Grille 23, Legal Seafood, Fogo de Chao, Summer Shack, and No Name Restaurant, plus one hotel breakfast buffet? As of this point I've spent over $83 more than my allowed per diem and still have tomorrow to go.)
I got back to the hotel around 9:30pm and there was nobody I recognized from our group in the bar so I headed upstairs to pop the evening drugs and start getting ready for bed. It turns out that Mike C. was still around but feeling ill, so at his request I took the last 2 packages of Tim Tams off his hands and will bring them home in my luggage so he doesn't have to.
Travel day! It's time to say farewell to Boston (until no later than 2020 when LISA returns). I managed to sleep in almost to 7am, finally, which means it's definitely time to return home and re-shift the body clock earlier again (even though the latter is... not particularly desireable). Showered then packed up the CPAP and toiletries.
While hanging out until it was time to leave I saw the the worst of the snow in the Detroit area was supposed to be 3–11pm. With a scheduled arrival around 5pm, that sounded less than ideal so I contacted the airline and they managed to rebook me on the 12:05pm flight, so I left the hotel early and took the T to Logan. I got there uneventfully, dropped off my bag, cleared Security, grabbed a leisurely lunch, and got to my gate just before they called my boarding group. Boarded, settled in, and had an uneventful flight to Detroit, landing just about 25 minutes late. My bag came off the carousel pretty quickly and I had literally zero wait for the terminal shuttle. I'm really glad I switched to the earlier flight; the roads were already icy and icky. I saw at least 3 spinouts and off-roadies, a couple of fender-benders, lots of fire trucks and a few cop cars, and occasional bouts of traffic standing still. I-94 was moving (when not a parking lot) well below posted speeds; I think my average speed was in the 30–35 MPH range. US 23 wasn't bad, but I only go from one on-ramp to the next off-ramp. US 12 and the rest of the way in was very very slick.
(Another reason I'm glad I switched: My original 2:36pm flight was delayed to a 7pm-ish takeoff and didn't land until almost 9:20pm.)
Cranked the thermostat back up, decided against getting the accumulated mail that should've been delivered Saturday, unpacked, grabbed a quick dinner, and went to sleep in my own bed.