The following document is intended as the general conference report for me at the 34th Systems Administration Conference (LISA 2021), held virtually from June 1–3, 2021. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
Opening Remarks
The conference proper began with the usual opening remarks: Thanks and notes (the sponsors, program committee, liaisons, staff, and board). They briefly mentioned the code of conduct. We were reminded to go to the vendor sessions after the conference sessions proper, though due to the time difference I declined. (The conference was held 11am–6pm in my time zone, and I still worked a few hours before it began, so I didn't have time; instead, I had dinner.)
The conference chairs did share some thoughts about future conferences.
Keynote: Beyond Firefighter vs. Safety Matches: Growing the DevSecOps Pipeline
Our first keynote was from Amelie Erin Koran from Splunk. She asked if we're putting too much emphasis on a given role or title, and if we are putting a bottleneck in place.
DevSecOps lets you spread the dev, ops, and sec roles, skills, and responsibilities across multiple people - there's still some specialization. (Q: Bringing dev, ops, sec, net and other disciplines back into the same fold seems to suggest we need to create more generalists in the future. Is this a fair assumption? How can those folks continue to grow their careers over time? A: Yes... however, don't over-rely on generalists... this is the caution. My original talk noted that for security, scaling works best for generalization for a lot of the block and tackling, but still reserving some resources to hire and utilize specialists... carefully.)
The government thinks there'll be a shortfall of security people (skills gap). Having dev and ops take over is possible, but what's the learning curve, and are they even interested in moving over to sec to begin with?
Q: Education is one of the things we're struggled with in this industry for decades. What are some of the effective ways you've found to improve this? Where should the industry be focusing energy?
A: It seems for security, at least, they are trying to strike at college and work down to elementary education to address the pipeline, but for coding skills, it's the other direction, getting coding skills early in life...
What has and hasn't worked?
- Accountant problem — Market thinks there's no need to pay for the skills.
- SEAL problem — AKA rockstar, ninja; professionalism of the space isn't advanced enough yet. Get credentials for skills you don't have. Need to look at actual history of deliverbles.
- Astronaut problem — [Data missing.]
- Firefighter problem — Best of intentions, work ourselves out of a job, burnout or pivot, pushed to automate... but the skills are still needed.
Gap analysis — identify what's missinh and then identify (or design/develop) solutions. Do you need a generalist? Highly-specific skills needed? Document what you've done and train your successors or the staff remaining when you leave.
Balance work and talent: Where are you, where do you need to be, and how do you get there from here; tactics, techniques, and procedures.
Q: One argument for the "generalist" model is that folks develop multiple skillsets and then experience high degrees of career mobility. What are the long-term career benefits of this "specialist" role model?
A: For some people it fits what they want to do well (especially if you like a deep dive). For most people though it doesn't fit what they want to do; it breaks mobility if the specialty disappears.
Keynote: Lessons Learned from a Ransomware Attack
Our second keynote was by Ski Kacoroski of the Northshore School District. They got hit by a ransomware attack, which are becoming more and more common due to small IT staffs, bug bounties, and security not being a priority for the district. In their case, Emotet was installed in Mar 2019 and the data auctioned off on the dark web, then trickbot was installed in Jul 2019 and the data auctioned off again, and finally ryuk was installed in Sep 2019. The attack itself started at 11:37pm Friday, September 20, to minimize reaction time since payroll was going to run Tuesday, September 24.
An additional complicating factor is that their normally-airgapped backups weren't actually airgapped because they were synching backups — so it's all corrupt. Their NAS was okay and the remote backups were too, but all of the Wwindows servers and AD domain controllers needed to be rebuilt from scratch.
Lessons learned include:
- Follow incident response process (see previous LISA talks for details).
- Relationships are critical.
- Airgap backups 100% of the time.
- Look for informal/accidental backups.
- Insurance will impact you (once involved you lose a lot of control). They consider cost to repair vs cost to pay ransom and if paying the ransom is cheaper they'll pay out... assuming you even have ransomware insurance.
- Forensics are time sinks.
- Need tracking processes (what happened where, what;s on deck, what costs are covered by insurances, what're the interactions, and so on).
- Need a project manager to track what's happening and what's next, and let the tech staff focus on the tech stuff. Include daily/handoff sitreps.
- Understaffing hurts: Lack of crosstraining leads to bottlenecks, projects are only partly completed, docs aren't complete, decommissioning deprecated systems is deferred, patching never happens, stuff pushed off to contractors/vendors means local staff don't know things, and so on.
- Need more storage (can you duplicate everything? Including snapshots? Windows ACLs required new SIDs because new AD domain...).
- Lots of trial and error when rebuilding infrastructure from scratch.
- Rebulds take time. Having templates ready in advance is helpful. May need to find media or licenses to rebuild things.
- USed to clean up environment (no legacy systems, patch/upgrade everything, only restore files asked for, and so on)
- Move services to SAAS, even if only temporarily.
- Many temoporary things: Work areas on file servers, training people in accessing the temp spaces, temp servers for forensics and dbs and services, temp desk and office space for all the helpers, may need to steal spaces for quiet-areas.
- Workstations had to be checked before being allowed back on the network, and all 1500 workstations needed to rejoin the domain and migrate profiles over.
- SCCM restores weren't possible because of the ACL changes.
- Surprise applications: You'll find things you didn't know were critical or what isn't backed up.
- Time of year surprises. HVAC and intercom needed hands on. Door lock and school bells when DST changes. Football scoreboard being aired on ESPN.
- Keep end users informed. Being as transparent as possible was helpful.
- AD side effects: Could use Open Source LDAP tools to pull SIDs, but the new domain — and the O365 domain — meant a lot of extra work. Rebuild GPOs from scrtch.
- Lock down admin accounts. Keep user and admin accounts separate (in their case they use different suffixes for -u[ser] and -su[peruser] accounts).
- Heterogenous is good.
- Take care of your people. Take breaks, feed people, require overnight rests. Ring a cowbell when a system is recovered. Say No when you need to.
- Uncertainty impacts. What's left behind? How much time will you lose to false positives?
The root cause analysis identified several causes: Understaffing, security shortcuts, insuficient antivirus software, backups not airgapped, crufty sustems, eggshell security using only firewalls, inadequate monitoring, and not enough gameplay planning (earthquake yes, but ransomware no).
Some saving graces: 40% services and 95% workstations were okay, NAS snapshots were untouched, the databases were backed up to NAS, and they could recover SIDs and CNs.
Some advantages of being attacked: It gives them the ability to make changes (training, 2FA, upgrade OS/patching, get rid of very old files).
Q: What would you do differently?
A: Better AD backups (cloud-hosted and so on). Plan better up front to reduce repeated work. Working with contractors again would be faster as they're familiar with it already.
Q: The way the insurance company was involved (forensics and so on) was interesting. Were you ready for it?
A: No, it was amazing. Their demands were specific and they wanted it done sooner. The impact was surprising. Cyber-insurance — especially now that more are being hit — is now more strict if you want coverage and then payment after an incident.
Q: Was there anything more surpising?
A: The surprise systems. Ski was there for 17 years and the fact that nobody knew the food POS systems had local storage, and nobody knew that it handled $30,000/day in sales, and some systems are very time-sensitive (door locks, school bells) or weather-sensitive.
Q: There's a lot of Men In Black neuralizer here. Did the instant response folks help you retain the lessons? How involved were they?
A: No. They're responsible to the insurance company, they say what you need to do ASAP, and there's no written report afterwards (since it's discoverable); all you get at the end is an oral report. It may have recommendations — but some of those won't be possible ("hire six more staff"). They only care about plugging the old hole and preventing it from happening again.
Q: Has this changed budgets?
A: Yes. Security is now a top priority. They had two budgeted positions the district wasn't allowed to fill, but this let them fill them.
Q: Can you share requirements of the audit so people can plan?
A: Yes. You have to have anti-malware documented and on all machines. Backups must be airgapped. Ask any cyberinsurance company and get their form.
Q: Have Procurement procceses changed?
A: Not enough. Instructional or a department will find something they like and IT is only brought in at the end. Now IT can do push-back like SAML login required. It's still an ongoing negotiation... but IT will lose the battle if the department or Instructional is pushy.
Q: Would a table-top exercise have helped see what the critical points were?
A: Maybe. It's only good for what you know (e.g., the POS system that only Finance knew about). If other departments were involved it'd've been better.
Q: Bringing consultants onboard?
A: They got there the day after they called: Hit early Sat, called Sat afternoon, had people Sun morning.
"Disorganizing" Your SRE Organization
After the morning break I went to the invited talk by Leonid Belkind, CTO of Stack Pulse. What does it mean to disorganize the organization?
- Democratize responsibility to all engineers:
- Treat reliability like a feature and build it in for every task.
- Train developer teams on montioring/slerting, observability, error budgets, SLOs and incident response metrics like MTTD/MTTR, and so on.
- Make every developer (incuding leadership!) part of on-call work.
- Empower autonomous but consistent action:
- People handling incidents should feel empowered to have all the relevant data and to take relevant remediation steps.
- It is much easier to ask for help when you're in a room with people but it's not so easy to reach out remotely.
- Build playbooks for every workflow — never do the same thing manually twice.
- Turn on-call or incident response into deterministic code, and make that available as "modules" to developers. For them, this means:
- Declaritive playbooks/workflows
- Encapsulated process steps
- Four parts to each:
There was an example in the slides; they use YAML.
- Enrich — Append environment and application context, assess customer impact, and assign severity
- Triage — Rule out possible causes, focus on suspicious signals
- Communicate — Open war rooms, create/update incidents, communicate with on-callers and stakeholders, and so on.
- Remediate — Bring the service environment back to operating state.
- Common language and format for all playbooks (no exceptions!).
- Educate and train team with playbook artifacts versus wiki articles.
How'd they do it?
- Spend one month on-call and playbook writing spread across all developers.
- Spend 3 months reviwing incidents and resolution metrics weekly, outlining missing pieces and scheduling their development.
Wnat did they learn throughout the process? Mostly about the human part of the process; the technology more or less just worked:
- Accept the new normal.
- Build to the individual. Trying to keep "everyone in the room" with Slack, Zoom, Discord, and so on doesn't work; it leads to increased fatigue, poor responsiveness, and low morale. Balance critical need with personal preference (introvert vs extrovert).
- Explicitly build culture. Directly share that you're working on culture as a project, define changes in day-to-day responsibilities, and build oppoprtunities for informal interaction that use different formats.
- Terminate loops locally. Redivide responsibilities, empower with playbooks as documentation, 4-eyes verification only for critical issues, and measure performance and share with the team.
Six months in, enrichment and RCA over 60% automated, MTTR reduced by 35%, playbooks used to manage incidents from create to post-mrotem, and over half the team has lef incident response.
Over a year in, playbooks as code allow for uniform processes and don't fail to delvier; 85% of the steps are automated.
How would someone else get started?
- Be open with your teams. Explicitly explain that the organization is embarking on a journey to change its culture.
- Identify individuals that are passionate about it and involve them in leading the efforts.
- Let the teams drive choices of automation tools. Technologists enjoy solving problems with tools much more than they do with manual processes. Tools do matter.
- Don't assume that people will tell you how they feel or how confident they are. Constantly monitor the "soft" metrics.
Everything We Did Wrong to Do Accessibility Right at BuzzFeed
My next talk was by Plum Ertz and Jack Reid of BuzzFeed. When working to make the site fully accessible they learned a lot of lessons in communication, logistics, and bravery. Ideally they'd've built accessibility in from the beginning (2006) and not when they did (2018). They didn't set themselves up for a good partnership with their external auditor. There'd been engineers who tried locally to push it, but nothing holistic. Lawsuits (or threats thereof) meant they could get paid external experienced auditors.
The developers and the auditors weren't that near each other.
They got leadership mandate but not really buy-in. Business-critical priorities prevented staffing to get a head start before the results were ready. So they got the report (around 400 issues, from easy to systemic), but nobody to work on it. They chose to divide and conquer and put things on existing teams as mjch as they could... except a lot (most) of the work wound up in the catch-all one-person week-at-a-time team and not any of the other teams.
They manually imported everything into Engineering's JIRA, not the third party's system, so since the other teams were involved it had to be JIRA like the rest of the work. Keeping the two systems in sync was manual. (Should've required API access in the contract.)
They asked a lot of unprepared people to do a lot of stuff, adding things to their roadmaps.
They didn't build flexibility into the plan. Buzzfeed laid off 15% of their workforce in Jan 2019. The silver lining was that pepole were looking for things to do and Accessibility had a big backlog of tickets. But they were too casual about turning it into a formal project and getting it on the roadmap, waiting on the bigger picture from above.
They told the story in terms of dollars instead of sense. Accessibility is a long-term investment and not a quick win. They were perhaps more-focused on specific KPIs that accessibility wasn't a priority. Leadership wasn't disabled and thus not that interested. They should've treated accessibility as foundational and not as a feature. (The engineering cost would be the same either way. Project management and auditors was overhead expense. They were sending things to the auditors piecemeal which burned through their hourly budget faster.)
Another mistake was asking for permission not forgiveness, especially for the legacy code (which may or may not have had owners). They wound up flipping the script and making the change (and nobody objected).
They saved the hard stuff until the end because of working piecemeal. (ALT text for images was the big problem.)
On July 10, 2020 they got their letter of compliance. Accessibility isn't one-and-done but a way of thinking.
What did they do right?
- Let the students become the teachers. Let the users guide content training.
- Took accessibility off the bargaining table: Do it right from the start of pay a lot to fix it later.
- Built a culture of empathy and expertise around accessibility.
Kind Engineering: How to Engineer Kindness
After the lunch break I went to the Kind Engineering talk bu Evan Smith of Solvemate. He started by quoting Tanya Reilly, "Kind is about being invested in other people, figuring out how to help them, meeting them where they are."
His talk had four major areas.
Code Reviews
Tone matters. Understand the WHY not just the WHAT and HOW. Assume positive intent and intelligence. Assume instead that you're missing something; ask open-ended clarifying questions, but not aggressively or challengingly. Iin the context of code review, consider preceding nit-picky comments with "nit:" or something. But is the nit-picking a sign of a larger problem (perhaps of something, e.g. formatting or indentation) that can be fixed automatically?
Know when to switch from asynchronous communication like a code review to synchronous communication. The latter can be private and public criticism is hard. Is there something one side or the other isn't seeing? Talking it out in real-time may be easier to clear it up.
Honesty
Be more than professional. Care about people and bring your whole self. Include the positives when you challenge someone (even though people tend to distrust praise). Admit when you're wrong. White lies aren't evil but they don't help.
Note the difference between NICE ("Good job in the meeting") and KIND ("Your answer was rambly and you missed the opportunity to convince the team, but it's a good idea so practice your elevator pitch").
You're building rapport with people, fostering the connections between people to build trust.
Psychological safety
- Be the first to ask for (encourage) feedback, especially negative feedback: What should I stop doing? Keep doing? Start doing? Seeking out criticism makes you vulnterable and more open, builds rapport with people, and makes them more amenable to criticism for you.
- Be inclusive. Is everyone contributing in meetings or documents? Find a way to give the quiet ones a voice. Prompting might be sufficient, or reevaluate how poeple can contribute. Be open to different backgrounds and different life experiences. Not everyone is comfortable speaking out in a meeting; should there be an anonymous feedback form? Let them use the format they prefer (voice, email, document, and so on).
- No blame. Often an individual failure is actually a process, environment, or workflow failure. People feel safer pulling out process/workflow/pipeline problems when blame isn't a thing.
- Turn failure into learning. It's important to recognise that failure is not absolute. It's not the end of the world, and we may not recognize it, but we (should) learn from it and do it better the next time around.
Feedback
Give and receive both positive and negative feedback.
- Ask for criticism first.
- Don't make it personal.
- Be specific about praise and criticism.
- Is there a solution?
Giving has three setps — emotion, credibility, and logic.
Receiving:
- Understand feedback preferences. Understand how you prefer to receive feedback and make sure people know or correct them.
- Listen, understand, and thank the person.
- Don't respond immediately. Don't react in the moment. Take some time to gather your thoughts and process it.
- Ask for clarifications or examples where you can.
BPF Internals
I switched between IT tracks and went to Brendan Gregg's "BPF Internals" talk next. It was mostly kernel internals and while I followed along I didn't take notes.
Lightning Talks
After the break I went to the Lightning Talks. There were four:
- Daniel Xu, below: Interactive resource monitor for modern linux systems. It replaces atop which has no cgroup awareness, data compression/can be corrupted, and had poor UI. It's used at Facebook by their resource control team for post mortem analysis.
- Ben Cotton (RHAT), Stop writing your own infrastructure. Infrastructure as code is good (automation yay). Applications themselves? Don't write them from scratch. Open Source and/or COTS apps exist; is your time spent writing testing debugging the code really adding value to your project or company?
- Caskey Dickson, Let's go to the colony [Mars]. Update your OSes in planning. Know time availability and number of servers so can figure out how many to do how often allowing for testing/verification before a mass rollout to new OS.
- Alolita Sharma (AWS), Adding matrics support to OpenTelemetry. Adding metrics requires updating semantic conventions, data model, data protocol (aka OTLP), collector, receivers, exporters, and language SDKs. Work going on in parallel; See opentelemetry.io/status for more.
Plenary: Computing Performance: On the Horizon
Wednesday began with Brendan Gregg's plenary where he gave a performance engineer's views about industry-wide server performance and some predictions in several areas.
Processors
Clock rate incrased... but practcally we're maxed out at 3.5 GHz clocks. There are exceptions but we're scaling horizontally (more cores, more threads, more server instances)... which puts pressure on the interconnect rate (over the past decade we've increased the bus rate 3.25x but core count 6x). Side effect: CPU Utilization counts "busy" even if stalled/waiting. Lithography is getting tinier: Silicon atom is 0.1nm, and lithography is down do ~2–3nm... except it's a marketing term with no reference to reality. Expect to hit limit by 2029.
(Chip shortage may last into 2023. Oi.)
Cloud chip race, too — Amazon ARM/Graviton 2 is in production already.
And now we have GPUs, FPGAs, and TPUs to accelerate processing depending on your workloads. Remember to monitor them too.
Predictions: Multisocket is doomed and will become an edge case. We're seeing 80–100 cores on a single socket plus cloud-based horizontal scaling, so why pay NUMA costs?
Simultaneous multithreading (hardware) future is unclear: Performance variation, ARM cores competitive, and after Meltdown/Spectre they're turned off a lot.
Core count limits — more generalpurpose cores max out memory, ernel/app lock contention, power consumption, and so on, so there will be a de facto practical limit.
More vendors meaning more choice, but be careful about optimizing for the benchmark.
Cloud CPU will have an advantage; vendors have >100k workloads to analyze directly and can use that to aid processor design, possibly with machine learning to help.
FPGA won't be adopted widely (beyond cryptocurencies) until there's major app support.
Memory
Many workloads are memory I/O bound. DDR5 (coming out this year) is coming out this year with a faster bus but needs processor support. Samsung has a 512GB DIMM. DDR latency hasn't changed in 20 years because they use the same 200MHz memory clock. Lower-latency DDR exists but it's not seeing widespread server use. Expect high-bandwidth memory (HBM) to grow. Used by GPUs now, can use 3D stacking, and can be provided on-package (on the CPU itself).
Expect another memory tier below DRAM but above SSD/HHD.
Predictions: DDR bandwidth will double every ten years. Don't expect single-access latency to drop in DDR6. DDR5 can get up to 2x wins based on workload. HBM-only servers may happen, especially in the cloud. The extra memory tier is probably too late to market.
Disks
Perpendicular magnetic recording and multu-actuator recording will help performance. Shingled magnetic recording gets 11–25% more storage but gives worse performance; good for archival workloads.
Flash memory disks have gone through a lot of technologies; affects the block erase cycle limits. SSDs have their own performance pathologies.
Storage interconnects are getting faster (and the specs include reliability, power management, virtualization support, and so on).
Predictions: Slower rotational disks for archival workloads. 3D XPoint will work as a rotational disk accelerator and as petabyte storage. More flash pathologies: Worse internal lifetime, more wear-levieling and logic, and more latency outliers.
Networking
Latest hardware uses 400 Gb/s now and 800 Gb/s is coming. Protocols and TCP congestion control algorithm changes. Things are getting more complex as more tunable performance features are added.
Predictions: BPF in FPGAs, massive I/O xceiver capabilities. Cheap BPF routers with commodity hardware. More demand for network performance because apps need more and more network.
Runtimes
Predictions: FPGQ as compiler target. io_uring I/O libraries. Adaptive runtime internals. 100-core scalability support.
Kernels
io_uring uses shared ring buffers for faster syscalls (batched I/O). eBPF everywhere (incl. on Windows!). eBPF == BPF. BPF Future will include event-based applications to run in kernel space but sandboxed. Emerging BPF uses include observability and security agents.
Predictions: File system buffering and radahead, CPU chedduler, and other policies can be kernel-based. Kernels will become automatically JITted. Kernel emulation will stay slow. OS performance for Linux will be more complex and have worse performance defaultsm BSD has high performance for narrow uses, and Windows has community performance improvements. Unikernels will get one compelling use case.
Hypervisors
cgroup v2 rollout and scheduler adoption increasing. VM improvements plus lightweight VMs to boot superfast like a container with the kernel inside the guest.
Predictions: Should see containers everywhere, and longer term more containers than VMs but more lightweight VM cores than container cores.
Evolution: FaaS → container → lightweight VM → metal, for light → heavy workloads, respectively.
Cloud: Microservice IPC cost drives the need for container schedulers colocating _ and cloud-wide runtime schedulers colocating apps
Observability
BPF FTW. OpenTelemetry, Grafana.
Predictions: More front ends (bpftrace and libbpf-tools). Too many BPF tools. Expect GUIs to find/execute the tools you want. Flame Scope adoption to show deviations between multiple flame graphs/heat maps.
Plenary: Performance Analysis of XDP Programs
Next I attended the plenary talk by Zachary Jones of Verizon Media on XDP. At a high level, XDP lets them optimize the Linux kernel to improve server efficiencies for their media platform:
- Why is measuring XDP/BPF performance hard? "Time" is measured in different ways, and precision varies between hard and soft IRQ contexts. That ignores hardware time. The 1% utilization they saw in the lab was 60% in reality — be careful that if you want to measure A but measure B instead and conclude C, you're screwed.
- What's Verizon Media's approach? They focused on the BPF code inside of XDP, specifically on analytical budget time measurement, building block microbenchmarks, and instruction level sampling and annotations.
- What were the outcomes? The methodology works. Sometimes you need to break problems into smaller chunks. Active Benchmarking is key for analysis success. The introduction of BPF struct_ops and LSM hooks present opportunities for performance analysis beyond XDP programs.
- What's the ongoing work? BPF and network stack are always changing. Measuring and understanding itter to ensure consistent performance, visualizing instruction-level sampling and annotation, and new consumer/fproducer-based test harness that Facebook added last year to kernel selftests.
Organizational Design for Technical Emergency Response in Distributed Computing Systems
After the break I went to Adrienne Walcer and Alexander Perry's talk about emergency response. Adrienne is on the Google SRE Disaster team. She started discussing the June 2019 Maya disaster that their code rollout caused cascading failures leading to internal tool outages, increased alerts, and mass pages. But the bottleneck was the network on-call person, and there was a lot of confusion. The outage took down Gmail, Snapchat, and YouTube for up to three hours.
When the scope becomes sufficiently large, where component incident responders can't see the whole picture, they change to a system-of-systems (SoS) responders as a second-tier.
- Component response — Single-system experts, know the local service in deprth, knows about service-specific mitigation strategies, and can try to keep their component from affecting the whole stack. They have Infrastructure, Product Service, and Internal Service components.
The June 2019 faulure was most of the Internal Service components going kerflooey. Because the multiple internal services were working in parallel nobody had a holistic view.- SoS response — Given a sufficiently-complex environment, you need an organized coordinated empowered response. These people are multi-system incident managers, holisticlly focused, skilled generalists, can diagnose systemic behaviors and identify root issues. At Google there are two types: Product-Focused incident response teams (IRTs) that look at incidents pervasive across broad swaths of a specific product or similar products, and the technical incident response team that responds to and helpds coordinate, mitigate, and resolve major service outages across Google (often due to incidents with broad or indeterminate causes).
Hierarchy: Technical IRT → Product IRTs → Component responder.They use a common protocol (with clearly defined roles in their incident response process), trust (responders ahve the authority to handle the incident without needing to seek authority), respect (everyone is comfortable escalating as needed; psychological safety is created and maintained), and transparency (everything is available company-wide).
Back to June 2019. The network component on-caller paged tech IRT who could formally assume incident command, assess the current state, organiuze people to coordinate the moving parts of the response, set piorities and delegate tasks, secure additional resources where needed, and remove administrative and communications burdens from the folks that can implement mitigations — and the network-savvy people were working on operations more than taking over IC.
Once service was restored and the incident closed, there was a very detailed postmortem. They could spin-off engineering resources to address the root cause and trigger conditions and prevent reoccurrance. They also rewarded the people involved for their efforts.
Q: How are incident responders trained?
A: A couple of different ways. First they have a really robust onboarding program and exrcise. "SRE EDU" is a week-long deep dive into Google tools, techniques, and systems, including a couple of practice emergency scenarios in a demo environment. That gives psychological comfort. There's also a more robust incident response training program. And then there's the disaster/resilience component; most people are used to the tools for their systems. The test scenarios are designed to be on something the person cares about, using the incident response protocol. They do it in the small scale as well as larger scale where multiple teams are given an infrastructure failure to try to coordinate and solve.
Specifically to roles, most people know what roles they prefer to have. If you really hate the role you've got you can swap with others. The ICs tend to be those willing to do it and not those paged and forced to do it. They've generally had the whole Incident Management At Google (IMAG) process training. A lot of it is learn-by-doing, though, even given practice incidents. Mentoring and shadowing helps; the IC has command but has a hidden backchannel with people who can give advice without making the IC look bad.
Q: How do you implement this in smaller environments or with a smaller pool of people?
A: A lot of people will lean on the generalist side of things. The smallest way to implement is to have N+1, a second person on-call or available to come help, or some understood escalation path to someone more senior. Keep the architects on their own on-call rotation for escalation. You don't want people to overspecialize to not be able to help out.
It's tempting to be optimal and let the experts (not management) handle production. But if the company is at risk then management wants to get involved... and you need to keep them from interfering and slowing things down. They need to have exposure to the incident process before joining an incident. Rotate the roles through the entire company. (CEO can't be in charge if the IC is.)
Q: What information do you bring together for the post-mortem? Who sits in on that discussion?
A: There's not just one discussion. Those involved discuss ini writing in detail all the aspects of the incident. They're large enough and global enough that getting everyone in a room is problematic. They collect all the potential data (log sources), recreate timelines, who did what when, and then investigate the success of mitigations. There's also system analysis and reflection work to find root cause analysis or triggers. They also cover what went well (and what didn't) and where they got lucky. And then open bug reports to prevent thngs from reocurring.
Groove with Ambiguity: The Robust, the Reliable, and the Resilient
Next up for me was Matt Davis' talk about how the words robust, reliable, and resilient can be ambiguous. Why are things considered one thing and not another? If we're building something to be robust is it also reliable?
Complexity is a cycle: Discovery, adaptation, emergency, and ambiguity. Emergence in software is a lot like music: Compare sheet music (staves, notes, dots, lines, but it's silent) to the emergent as it's played on an instrument.
Core attributes of a complex system: "Diverse, interdeendent, network entries that can adapt" (Scott Page, Diversity and Complexity).
Rather than use a dictionary he'll use prepositional phrases:
- Robust to failure. Example: A groove of a 33rpm record is effectively the waveform of the encoded music that the record needle uses to play it. How do we make the record robust to failure and stay economical? Edison's wax cylinders (1877), Berliner's brittle shellac (1895), and eventually electromechanical recording on more durable materials. 1950s we standardized on PVC because it's robust to failure (the groove remains intact).
The software equivalent is high availability. Redundancy is designed so that if one part of the system fails another part covers its function. Decreases efficiency and increases cost, but it's complexist we accept to get the higher guarantees. Some systems are HA by design for scale and scheduling (Kubernetes) or have distributed data replicas (Kafka).
Fallbacks are a cousin to high availability. They're more granular forms of availability that sacrifices accuracy or consistency. In software this can be a down-for-maintenance page or static/older data.- History in reliablity. Phonographs used to be belt-driven. Magnetic direct drive increased possibilities (scratching, juggling, beat-match to other records to mix multiple sources, and so on) These innovations required the reliability of the direct drive turntable.
In software engineering terms, this is like [chaos engineering] game days. You validate configurations and verify outcomes, iterate for practice and introspect models, measure capacity to manipulate slack, operable tools and operational readiness, bridge teams and break assumptions, and embrace ambiguity to experiment assumention. (Consider a jazz group. There's practice and improv. Gigs can be scheduled. You may not know when the incident will be but you can schedule the game days.)
Once you've done this you can document it in runbooks. By practicing with game days you can help keep the documentation up to date.- Sources of resilience. (Digression into history of post-WWII Russian music history. Viewership count dropped from ~30 to ~20.) The groove in the record is itself resilient by definition.
In software engineering, we build for adaptation.[At this point the video player corrupted hard and I missed the rest of the talk.]
Protecting System Integrity with Trusted Platform Module
After the lunch break I went to the TPM talk by Dmitrii Potoskuev and Marco Guerri of Facebook. Sadly, the streaming of this track broke before the session began, causing a substantial delay (35 minutes due to having to restart the video due to audio dropouts in the 23-minute-delayed livestream).
Every software and firmware component running on a system can be the vector for delivering an attack to the host itself and the wider infrastructure around it. People often focus on protecting the system from what runs in user space or kernel space, and we don't always include in our threat model the integrity of the lower layers in the stack. In this talk, they wanted to show what could be the impact of compromising a host through a persistent implant in its system firmware. They focused specifically on UEFI, the industry-wide standard that defines how system firmware should operate. They demonstrated a "hello-world" system firmware malware from its development to its injection on the host. They then introduced the concept of Trusted Platform Module, a secure cryptoprocessor that has become an industry standard on consumer and enterprise systems, and explained how the TPM can help protect the platform from our demonstrative malware. They assumed that our system requires secrets to be able to interface with the infrastructure around and they leveraged the TPM to give the host access to those secrets only if they could guarantee that all layers of the stack have not been compromised.
This was a demo of how they set up a shared secret and then used a malicious driver to compromise that secret. They showed how TPM protects the system. Secure boot could help in some cases but it doesn't scale well. The demo showed how sealing and signing can help.
Q: From the Quote we can only check that it comes from a valid TPM but not a specific TPM, leading to Cuckoo attacks. What is your take on that?
A: Quote is usually signed by Attestation Identity Key, which is specific for each TPM. In order to make sure that the quote comes from a particular host we should enroll its EK and AIK first.
It should be noted that TPMs are easily detachable from the system. This means TPM is not so useful for protecting from presence-based (physical access) attacks. Microsoft did a big work on integration TPM with CPU in order to mitigate the issue, but it's not applicable for ordinary servers.
The Cornerstone for Cybersecurity-Cryptographic Standards
Next for me was Dr. Lily Chen's talk on cryptographic standards. This presentation introduced NIST Cryptographic Standards and their applications in cybersecurity. The presentation also discussed transitions and validations. It highlighted challenges and solutions for next generation cryptographic standards, including challenges to deal with quantum threats, new cryptography transition, and lightweight cryptography for constrained devices.
History: Developed first encryption standard (DES, 1977) and has been involved since. Nearly every device now uses their standards because of public-key. They also published key generation and management guidelines.
They've managed transitions — DES → 3DES → AES, and so on — as computing power and math techniques have become more powerful.
Next generation: Standards have to deal with extremes (powerful attacks, quantum computers, constrained environment) — but they need to have some degree of backward compatibility and interoperability (TLS version and cipher suites, for example).
Much is general-purpose now — what about special usage moving forwards? Synchronizing with industry best practice. What about international adoption? (And yet some countries have their own standards and there're some we can't eport to, and so on)
New initiatives:
- Post-quantum cryptography for public key based. QC changed what we believe about the hardness of discrete log and factorization problems, meaning that well-deployed public-key cryptosystes (RSA, DH, ECDSA, and so on) will need to be replaced. Also affects symmetric key algorithms, but that can be mitigated by increasing the key size. Researched PQC categories are lattice-based, code-based, multivariate, hash/symmetric key-based signatures, and isogeny-based schemes.
- Lightweight cryptography. It's not weak but is for constrained environments that aren't well-served by existing NIST standards. They had workshops in 2015 and 2016 to get industry feedback, and that led to publishing NISTIR 8144 in 2017. By 2018 they've finalized the scope and criteria to issue a call-for-contributions. (Think IoT.)
Summary: NIST cryptography standards have been the cornerstone for cyberseucrity, are developed for non-national security applications, and the next generation cryptography standards will deal with quantum threats (PQC) and constrained environments' protection demands (lightweight crypto).
Popcorn Talks
After the break I went to the Popcorn Talks. Popcorn talks are informal, short, silly, and fun talks! Speakers were given a surprise set of slides and had five minutes max to ad lib a short talk based on their contents. There were lots of GIFs, memes, and extremely silly slides, which may or may not have been related to technology.
It was very silly, it showed off the speakers' improvisational talents, and I took no notes.
Why You Should Burn Down Your Datacenter
Despite the attention-grabbing title this talk wasn't actually about pyromania or destruction, at least not directly. Facebook's Mike Elkin is not a controls or mechanical engineer; this isn't about cloud computing but more about the industrial control systems. There are three components he talked about:
Datacenter 101
We care about power, cooling, and space for them. We'll be ignoring space since it's more about planning.
- Cooling — Outside air is mixed, cooled, fanned, and then exhausted or returned to the mixing loop. This requires both power and water to happen. Psychometric charts can be used to track. It takes a while for equipment to cool or heat the air.
- Power — The power path goes through multiple systems (step-downs, PDUs, and so on). Metrics are important. We tend to hear about PuE and WuE (power or water usage effectivenesss): PuE == total energy over IT load, and closer to 1.0 are better. WuE is annual water usage over energy consumed, and lower is better.
So fault domains aren't just the network, but the power and water. He showed a chart mapping which rows are served by which power and air handlers.
How do the control systems work? Using the Purdue model:
/ Level 0 process: senors actuators, CTs, fans Data center < Level 1 devices: PLC, controller, gateway \ Level 2 control: SCADA< BMS, PMS, HMI Enterprise / Level 3 ops: Workstations, DC, time-series DB \ Level 4 business: warehouse, DCIM, ERPYou don't wamt the power draw to exceed the breaker limit.
Smart Infrastructure
Requires sensors and their data (and storing the data). Given that, build or buy software to do it? Not many want to deal with it so most go the "buy and integrate" decision. The building systems may be collecting what we want, but can the sensors handle the load and do they really gie us what we want?
We need to know what the ICS devices are (both network and facilities information as well as anything protocol-specific and how the devices all interrelate; keeping this information current and correct, like with any inventory, is a problem), collection systems (with what endian and precision and scaling and so on), and data access (have tiers for different user types; many can view the data at aggregate but you want control limited and very very granular).
Burn it down and start over
- ICS Security... generally doesn't exist. No authentication, authorization, encryption, or anything else. Logging might not even exist. ICS tend to rely on hard exterior security.
- Time to change or replace: Software may last for hours or days, servers for years, but buildings for decades. And buildings try to be 100% available (even if data centers therein don't).
- Other problems:
- Network performance varies. Remember the ping of death — some devices are so specialized they can't even handle a regular ICMP ping. We should define network standards on incoming euqipment.
- We also see microbursts causing problems, especially with broadcast and multicast. You may need to power-cycle a device in person to fix a problem.
- Latency is a problem: If it takes 60 seconds to get the data but need to reach a decision in 45 seconds the decision will be based on stale data.
- Some vendors limit TCP connections to prevent unauthorized connections, but that means new systems can't request data.
- Data caches may have stale data as well. Knowing the cache timing helps you know how accurate your data may be.
- Queries may return garbage data. Does the sensor return valid data ("1 jiggawatt" doesn't make sense when the regular values are in kW).
- Equipment changes, but many ICS require manual configuration (which may require on-panel physical access)... which is problematic. Ask vendors what can be automated (is there an API?). Develop processes to setup and test.
In summary:
- DC data isvaluable.
- ICS is frustrating.
- ICS needs to be modernized.
Q: In this process, did you develop a product validation checklist or method to qualify industrial control devices prior to widespread deployment?
A: They have added requirements (especially around network performance, detecting data caches, and so on). Checklist items exist not just for equipment's core purchases but also its network performance and how data can be collected.
Q: Concerning the lack of standards and most sensor data problems, what do you predict for the future of ICS equipment?
A: The ICS industry seems to be ~25–30 years in the past, compared to the IT, network, and security realms. If we don't bring them up to current standards there's certainly increased risk for failures. We need to convince the vendors that making these changes is actually important. The equipment is very specialized and we need to have a critical subset of them to make these changes to move forward safely.
Q: What other terrible life choices have you met?
A: Iterative devlopment cycle. Significant problems is data modeling: What attributes do you want, how do you query things and how often?
Selectively Sharing Multipath Routes in BGP
Next up, Trisha Biswas of Fastly talked BGP.
Overview of BGP
External routing protocol between ISPs. Routers running it are called speakers. Is best suited for a network of networks or a network of autonomous systems (AS).
AS runs interior and exterior gateway protocols (IGP, EGP). Between ASs it's called interdomain routing. BGP neighbors are peers and must be configured statically. Peers in different ASes use external BGP (eBGP) for communication and those within the AS form internal or iBGP sessions.
A route is a list of ASes and other attrributes on the path to the destination. BGP is a path-vector protocol: speakers advertise reachability to other networks (prefixes) with their peers. Advertisiing a prefix replaces previous announcements of that prefix (implicit withdrawal).
Best Path Selection
She stepped through several examples in the slides. BGP uses the number of hops as the first filter and then other attributes to tie-break to select the best path.
Additional Paths
Routers propagate only their best path, which is calable but you lose path diversity. BGP Additional Paths (RFC7911) allows sharing of multipl paths for the ssame prefix WITHOUT the new paths implicitly replacing previous paths. This helps achieve faster reconvergence after a network failure.
Selective Add Paths
You can't share all paths from/for all ASes since there are thousands of prefixes thus thousands of paths and millions of routes. So Selective Add Paths is a feature extension to limit how many or what kind of routes do and don't get shared.
Experimental data shows that only a few prefixes serve most of the traffix, so sharing multiple paths for only those prefixes unlock the potential of multipathBGP without compormising peer performance.
Policy Based Filtering of Add Paths
Routes can be filtered based on any of the BGP route attributes. Best or preferred path should always be advertised.
Demo
She stepped through a demo of the Add Paths functionality.
Conclusions
BGP add paths helps achieve faster reconvergence but can affect the overall performance of the peer due to the large number of routes advertised.
Although ASes have hundreds of thousands of routes, only a few hundred (experimentally 0.1% to 1%) serve most of the traffic and are likely worth sharing.
Selective add paths helps achieve the best of both worlds bu leveraging BGP multipath without overloading the peers.
Q&A
Q: Is it really involved to determine the paths that are most used/popular, so that you can selectively share those paths?
A: It's somewhat involved but doesn't need to be in the control plane. They're working on a tool. It can be a wrapper outside of the routing stack.
Q: Adding millions more routes would be very expensive, but do you think network vendors should look at increasing the resources available so this could become a default feature rather than having everyone figure out their paths selectively?
A: Because it's a tradeoff — heuristically/cleverly versus more resources — it's hard to convince vendors to increase the resources. You'll need heuristics anyhow... especially since Internet routers aren't awlways cutting edge.
Q: I have a use for BGP within Kubernetes or between Kubernetes cluster, not exposed to the internet. Where would be to learn about that?
A: Unsure. there are various open source BGP implementations you could use within the k8s cluster. She uses bird; there's also frr and openbgpd. Any of those shuld work. Start experimentally between two nodes.
Q: Have you considered other routing software like quagga?
A: They have. The issue is that once you build a network with one routing stack and configuration it's very hard to move away from it later.
Q: Are you using Selective Add Path in Production now or is this still being investigated?
A: It's still just in test. They're looking to opensource bird.
Year One: Transitioning From Application Engineer to Infrasec Engineer
After the break I went to Misty Hall's talk about how she transitioned from an application engineering role to a infrastructure security role. It was loosely a case study (with a fish theme) that can let us determine what if anything we should do.
The base assumption for the talk is that we want to improve hiring/mentoring, reduce attrition, or avoid hiring one type of person. Some questions to consider:
- Should we even fish for infrasec engineers in the application engineer pond?
- How complicated is the technical context?
- Hiring is fraught with difficulty, so do you hire one and transition them to another role, or do you hire the other directly?
- What's your budget? Senior hires are expensive. Cross-training a mid-level engineer may get you the specialty engineer for less.
What about "slot limits:" What if you hire the wrong person (hook the wrong fish)? You can't "throw back" engineers you hire because they're too small or the wrong fit.
(What if you try moving from app to infra and don't like it? Does your company give you retreat rights?)
(Are you inclined or disinclined towards or away from a tool or product based on its community?)
Once you hire or transition an infrasec engineer and they like the work and are ready and motivated, how do you set them up for success? In her case there were no specific expectations about growth or tooling, but also no organized mentoring structure. You really need some sort of guidelines for where someone should be after a month or three or six. Core competencies in a ROSE chart are useful.
She eventually got put on a project. There's a lot of hard skills she learned as part of it, especially moving from startup-land to government-land. Continued and frequent pairing helped.
Integration opportunities: Culture adaptation, conference talk playlist, mentorship with clear and specific expectations, and moving families.
Q: What did you feel helped you most understand your new world? (What should onboarding look like?)
A: You basically need some kind of guidewires. Their culture is to do what you're most interested in for the company to get the most out of your labor. It's hard at an agency since you need genralists above you to get an idea where you want to migrate skills wise. But managers and CTOs have shared resources. She was an early-enough hire they could treat her as an experiment. She intentionally didn't talk to management or CTO about this talk because she wanted it unfiltered. A lot is trial and error; look around, see what needs doing, and do it.
Q: Has anyone else there made the transition afterwards? Is there a way you'd guide them in app→infrasec?
A: Not really, but she's defintiely willing to mentor. It's a lift to mentor ans she'd like to see mentoring be rewarded for seniors.
SkillOps: Real-World Approaches in Skilling and Building World-Class Security & Technology Teams for a Remote-First World
Next up for me was Abhay Bhargav's talk focusing on training an skilling-up. He's got three major sections.
Changing nature of IT and IT Security jobs
Looking at 2020 and 2021:
- The pandemic changed our workflow and outlook from 2019 and previous.
- Even conservative companies allowed employees to work from home; there're training and logistical needs that need to adapt.
- The cloud has become a massive change agent (and not just because the pandemic moved a lot of folks to IaaS and SaaS in the cloud).
- Security incidents increased.
A lot of this is going to continue as the new normal, even as we move from mostly-remote to more of a hybrid model.
In 2020 there were 3.12M jobs unfilled (ISC2). There may be reasons (like poor training) for this kind of skills shortage. Also, 70% of organizations are impacted by talent shortages (ISSA), and 52% require hands-on cybersecurity skills (ISSA). 32% of IT budgets will be on the cloud (Forbes). 99% of cloud security failures will be attributed to customers [misconfigurations or errors] (Gartner).
A commonly-mentioned statistic (unsourced): For every 100 developers you have 10 devops and 1 infosec people.
Before security consulting was vulnerability assessment and penetration testing, threat modeling, and vulnerability management, at a single (possibly repeated) point in time. However things have changed: Now places are overwhelmed: All of that's still happening, but add DevSecOps and feedback loops, bug bounties, threat hunting, red teaming, cloud security, Kubernetes security, and more. Security folks have a lot more to do and aren't being staffed up (due in part to a talent shortage or skills gap) to be able to do it.
The best security teams tend to decentralize, treat engineering as a customer, and set useful defaults (the default way should be the secure way).
SkillOps
Continuous microtraining (or small doses of content) accompanied with hands on labs to increase capabilities quickly, along with tailoring the security education to the organization:
- Continuous, micro training works well with a distributed workforce, encourages small wins and builds on the compounding effect. Using 45- to 60-minute courses with hands-on focused content with hands-on labs and cyber ranges. Existing CBT is too long (especially for remote workers with multiple responsibilities). For instance, they have Wuthentication and Wuthorization with Kubernetes versus Kubernetes Security, or AWS Network Security vs AWS Security. People are much more engaged and enthused to learn more because it's fast.
- Hands-on labs and cyber ranges are more important. They're essential for building InfoSec skills, and they need to be easy to run and set up (leveraging the cloud to spin things up and down helps a lot). Having an element of challenge is always something that can go a long way ("Identify and fix the four security vulnerabilities" with a verification script; cf. gamification, and capture-vs-defend the flag events).
- Another problem is that training tends to focus on red team or exploit-focused education. When you're hiring someone you want them to fix or constructively address the problems, and thus they need to know the defensive side not just the offensive side. Tailoring the security education to your needs, especially defensive needs, will help develop a broader and more-adaptable skill set. (BlackHat has seen an increase in defensive training.) Note too that cloud-native technologies require a greater understanding of defensive configurations and parameters.
Recommendation is to apply both offensive (red team) and defensive (blue team) training and doing a de facto "purple team."
[Some resources he's impressed by are in the slides.]
Conclusions
- Training is essential to build a successfull security (or technology) team, that needs to be remote- and hybrid-friendly.
- Leverage continuous microtraining that is remote-friendly.
- Hands-on labs and cyber ranges are essential to demonstrate and work through complicated technology concepts.
- Both offensive and defensive security education are necessary (if not essential, and the latter is often overlooked).
Q: How large is the team he implemented this on?
A: It's a collection of experiences from various places, some startups and some multi-product teams (100–200 developers).
Q: Some of us come from smaller teams if not startups (e.g., a 30-person team in a multi-hundred person engineering org). How do we get started in changing the culture and/or implementing these sort of changes?
A: It's about finding the resources. Engaging activities (capture/defend the flag) to pique curiosity is a good starting point. Then circulating smaller security/technology bits that are specific to your enviornment, or sharing relevant research. Once you build interest it can start to be self-sustaining. Tech people tend to be curious and hitting those buttons help people increase interest.
Service Mesh Up and Running with Linkerd
After the lunch break I attended Charles Pretzer's talk about linkerd. He started with an overview of service mesh concepts.
A service mesh uses the network of a distributed system to observe, secure, and add reliability. Twitter started this in 2010 to break their monolithic application into a distributed system... which is by definition more complex. A service mesh provides insight into which pieces or parts may be experiencing issues. This decreases both MTTD[etection] and MTTR[esolution].
The data plane of a service mesh consists of may proxies to handle service traffic, injecting proxy code (YAML) into the service. The data plane lets the services all talk through proxies, opening up powerful options. Anything handling the traffic between services now can provide telemetry to capture latency. linkerd also provides mutual TLS (mTLS) for security. The developers don't need to make this part of their code; the proxies take care of it for them. A service doesn't know how many instances of another service there are since the proxies can do load balancing.
The control plane of a service mesh is a finite set of components: Identity service, destination service, proxy injectors, and controller, all of which talk to and configure a new proxy on the data plane.
Some other main concepts:
- Security — Mutual TLS (mTLS) allows for verification and encryption with a common trust source. The identity service signs the proxy's identity request so proxies can communicate over encrypted traffic. mTLS adds the verification (the proxy is who you think it is), which is helpful in a multitenant environment. Because this is built in to the proxies the service developers don't need to care.
- Observability — Proxies collect TCP and HTTP metrics. Monitoring tells you what's going on at a high level but observability provides actionable metrics to reduce MTTD and MTTR for issues. Proxies enable service topologies. (linkerd uses linkerd.)
- Reliability — The mesh allows for latency-driven load balancing, retries, and timeouts to ensure requests are processed (or rejected) properly.
He ran through a CLI-based demo that, thanks to the lower-quality video, was hard to see. See https://linkerd.io for more information or to download it.
The What and Why of Documenting Your Infrastructure
Next I attended Kevin Metcalf's talk about documenting your infrastructure. He basically had three parts to the talk.
Part 1: What happened?
His employer offered early retirement so his Linux admins and his supervisor took advantage of it, gave him a bunch of their responsibilities "temporarily." What did he get? List of duties and responsibilities and servers and any documentation he could find — all of five sentences on one page but no deployment automation and no configuration management.
So what did he do?
- Assume it's a joke (denial).
- Swear (anger).
- Ask for a raise (bargaining).
- Cry (depression).
- Get crackin' (acceptance).
Part 2: The plan
He developed a plan:
- Get list of servers. They had a finite number of server rooms and physical access, so he could inspect what was there and then narrow that down to what he needs to care about.
- Get login credentials. Reset root from single user mode as needed.
- Get at least one user contact for each system to identify what services the server is running.
What should we do differently? Think about it:
- Have an up to date list of servers. (Device management. Security.)
- Have a credential management plan. (IAM. Password management solution.)
- Perodically check in with the users to see if they need the service being provided for the system. (He could shut down about half a dozen machines that were no longer being used.)
(Aside: There are two kinds of people: Those who think "I suffered through X so everyone else should too" and those who think "I suffered through X so nobody else should have to.")
Part 3: The work
- Document the duties of the position. The initial list was what management thought the position did, but that didn't match the actual ongoing duties. Ask coworkers and customers. Don't sweat the details; iterate. Categorize the duties: Infrastructure (DNS, printing, and so on), security (remove user access, SSL, patching/upgrades, and so on), and licensing and consulting.
- Document the device inventory.
- Go all Ansible on thsat $#!+. Use configuration management to automate and manage everything. At least manage non-standard system services, key (access) management, patch automation, gatekeeper tasks for developers,
Takeaways:
- Hoarding knowledge is not job security. Be respectful of whoever comes after you. Treat them as a human being who shouldn't suffer just because you did. (Ask about documentation in job interviews.)
- It's hard to secure what you don't know you have.
- Don't trash talk your coworkers, employer, or management. This works for people both in and outside your organization.
- Figure out what is (and isn't) important to document. What do you wish someone had told you (or your coworkers wish someone had told them) when you started the job? Document that.
- Documentation is an equity issue. Not everyone has the same experience or had the same opportunities when they start a new role. Access to information is an equity issue. If equity is a value your employer professes to have, spend the time to document things right.
Closing Remarks
After our last break Avleen and Carolyn had some closing remarks. Avleen started with a retrospective on LISA: The online archives date back to 1993 (though we started as a workshop in 1987 and became a conference proper in 1990). Some interesting points:
- LISA 9 (1995), a very high volume internet service got millions of connections a day.
- LISA 10 (1996), a new twist on teaching SA was making it web-based.
- Avleen's first was LISA 20 (2006) and he met Tom Limoncelli.
- LISA 21 (2007) was Avleen's first talk.
- LISA 25 (2011) was Carolyn's first talk.
- LISA 26 (2012) was Carolyn's first time as chair and Casey's first year as ED.
Thanks to the USENIX staff for making things work as smoothly as possible.
Favorite moments this year? Carolyn said all the keynotes and plenaries. Avleen said the variety of talks: All kinds of industries (not just tech), all kinds of talks (subjects), diversity of speakers (cultures and backgrounds), and so on.
Usenix is Open Access — all the talks will be online soon after the conference, without a paywall. Consider becoming a member or donating money.
There is an attendee survey; please fill it out.
Closing Plenary: Practical Kubernetes Security Learning using Kubernetes Goat Configure
We closed down the technical portion of the conference with Madhu Akula' plenary session on Kubernetes. If you're new to Kubernetes he strongly suggests looking at the Illustrated Children's Guide video.
The slides were his walking through the docs in a browser, so I didn't take specific notes. See https://github.com/madhuakula/kubernetes-goat for details.
Recorded videos are good. Live captioning of them is less so because of jargon. Can we get captions done in advance? That said, if the speaker is using humor in the video, because it's prerecorded there's no way for them to tell whether the jokes are landing... which means there's no chance to adjust the talk if they are (add more) or not (remove them).
It's not clear how long if at all the session-specific chat text is kept. Is it like Zoom where it's archived to a text file on ending the call, or is it ephemeral and lost when the session ends?
It's not clear whether you want multiple or non registrants in the event of a split-track session. For example, if I were attending talk 1 in track I and talk 2 in track II, should I register for one, the other, both, or neither? How will that impact your metrics on the back end to judge success?
That said, should we switch the schedule from 90-on-15-off to 45-on-10-off so people can more-easily jump between talks in a given session?
Were you tracking the number of live viewers (max during session, average between 15-min-after-start and end, ...)?
Timezone issues. Having the schedule in one timezone means meal breaks for other timezones are odd. I recognize this is a problem and there's no good solution. Perhaps more 10- or 15-minute breaks between the 45-minute sessions (because "move the airwalls" and "reset the room" aren't issues now) to add more flexibility?
It's hard to provide applause-as-feedback when the speaker (video) is done or when the Q&A session ends. ":clap:" can only go so far, whether in Swapcard (plaintext) or Slack (emoji).
I recognize that multiple simultaneous tracks can be problematic. We've only got up to two this year but have had up to four or more in the past (e.g., referreed papers, two invited talks, and Guru-Is-In). Are we considering going back to a more-content conference (especially since prerecorded video doesn't add a lot of overhead)?
Might be useful to recommend speakers don't use the bottom 10–15% of their slides since that's where the captioning appears, making both the captioning and the section of the slide unreadable.
The 9:45–11:15am PT (pre-lunch) talks in track II on Wed Jun 02 was a barely-mitigated disaster. The player kept failing for people at random. Sometimes a refresh would work, sometimes it would fail again quickly. For some, downgrading to 240p (which made slides illegible) helped. Refreshing the video player every 10–15 seconds is untenable.
The 12:00n–1:30pm PT (post-lunch) talks in track II on Wed Jun 02 was a less-mitigated disaster. We lost 22 minutes while Usenix and Swapcard troubleshot whatever was broken with live streaming. I feel sorry for the speakers who had to put up with the delay (and the possibly-lost audience who switched to track I out of boredom or frustration).
It's often very hard to view demos, especially command line demos. Even it the "use 60% of my browser window" mode it's very difficult to see the CLI commands used in demos... even if the speaker doesn't clear their screen quickly. Can we recommend they increase their typeface size?
They didn't announce a venue or chairs for a next LISA conference, be it in 2022 or beyond. I spoke with Executive Director Casey Henderson and the short answer is that it hasn't been decided yet whether (and if so, when) it will happen.
My thoughts are that I'm hopeful but not optimistic that one will happen. Over the last decade or so, the program has changed substantially: We've axed the guru-is-in track, refereed papers, and full- and half-day tutorials and workshops; we've gone from a five- to six- then three-day event; and the international yet regional SRECons have a substantial audience overlap with LISA. With all that, it may no longer make financial sense for the USENIX Association to host LISA.