The following document is intended as the general trip report for me at the 24th Systems Administration Conference (LISA 2010) in San Jose, CA from November 7-12, 2010. It is going to a variety of audiences, so feel free to skip the parts that don't concern or interest you.
This was my outbound travel day. It took an hour or less to drive to the airport, find parking in the cheaper lot (work reimburses up to $12/day), catch the free shuttle to the correct terminal, check the luggage, clear Security, and get to my gate. The first flight (to Salt Lake City) departed ontime and went well enough, considering we had a lot of "rough air" and at least one unhappy baby. We landed early but had a 20-minute hold in the penalty box because the plane that was to have departed our gate before our early landing didn't, and ground control apparently kept saying "Oh, they're just about to push back." We finally got in only 10 minutes late, but there were some people with tight connections (15 minute layovers? what idiot approved that travel plan?). I had about 45 minutes, so I wasn't stressed at being in the back of the plane. Eventually I got off the plane, grabbed a sammich for lunch, and got to my next flight's gate just after they were supposed to have started boarding. The aircraft was delayed inbound, so we eventually departed about 5 minutes late but we made up the time in the air. Landed and arrived on-time in San Jose.
SuperShuttle at SJC is prepaid reservation only. I had forgotten to make reservations, so when I'd got my luggage and got out to the ground transportation area I tried to make one around 1pm. The online system wouldn't let me do so because it wasn't enough notice, and I managed to hang up the call to their reservations number on work's iPhone twice before switching to my personal phone which was fine. They gave me a reservation for 2:30pm, then put me through to Dispatch to see if they could get me something earlier so I didn't have anouther hour-plus wait. They did and they could, so Dispatch told me to get in the van that was just arriving. The driver said he wasn't going to downtown San Jose so I should catch the van right behind him and declined to let me board. I called Reservations back to connect me with Dispatch who said 'Oh, that other driver isn't going to San Jose and the first driver should've taken you, let me fix that." Eventually he said there'd be a van there in the next 7 minutes; 15 minutes later the same driver who'd declined to let me board came by to pick me up.
Anyhow, I managed to get to the hotel and checked in. They said the wireless was complimentary and to pretend I was paying but the fees wouldn't hit the folio (just like last year). They gave me a perectly adequate room with a king size bed a very short walk from the elevator. I unpacked, swapped out t-shirts, and went to hang out at pre-registration with Amy, Brian, Derrick, Lee, Mark B, Narayan, Ski, Trond, and others as we all rotated through. Got through Registration (eventually, since at 5pm when they opened they only had the prereg boxes for last names beginning with A and B), got my stuff, and finished my scavenger hunt card in the first 30 minutes. Spun the prize wheel and got a Whammy, so Anne had me pick a prize of my choosing. I grabbed one of the SAGE Short Topics books I hadn't gotten for the library yet and when I dragged it back to the table convinced others to do so as well.
Went to the Beginners' BOF for the free food (cheeses, crackers, Italian cold cuts, roasted veggies, insalata caprese with very little basil) before meeting up with Amy, Andy, Dan, Gabe, Lee, Ski, and Trond and heading out to dinner. The original plan was Italian but they had a 45-60 minute wait, so we walked a few blocks until we came to a newly-opened churrascaria, Maceio Brazilian Steakhouse. We got a table for 8 around 7:30pm on a Saturday night with no more delay than it took to get the table ready. We mostly skipped the salad bar (I stuck to some salami and roasted garlic and insalata caprese with actual basil) and devoured lots of meat. The garlic-embedded beef we started with was a little dry, but the rest was either good or excellent: beef (flank steak, sirloin, tri-tip), chicken, pork (loin, ribs, and sausage), turkey, and pineapple were subsequently scarfed.
We ran into the USENIX staff on the way back to the hotel but once we got there I pretty much decided that 21 hours awake was enough and headed off to bed.
Daylight Savings Time ended this morning, but my body insisted on waking up at 4am biological again (so 1am-the-first local). Processed some email for work — reviewing the 4am (Eastern) logwatch stuff mostly — before going back to sleep. Woke up around 6:30am local and finished the morning email and started writing up this trip report.
I hung out at conference registration for a while, Hallway Tracking with people (saw AEleen and Kyrre and Adam and Toni and Rik and Alva). Adjourned to the Marriott lobby where there was both power and network to get on the conference IRC backchannel. (Power wasn't available in the Hallway Track space in the convention center lobby area until the morning break.) I eventually moved back to the Hallway Track space to do more editing of the two internal books for work I brought with me.
For dinner, I went with Adam and Tom to Mark and Rachel's place. (Nice house. Delicious food.) Got back to the hotel, schmoozed a bit with folks in the lobby, went upstairs to deal with a work-related outage, spent a little time in the hot tub, then discussed university issues with a colleague late of U-Iowa before heaing up to crash.
Today I tried to sleep in, but my body inited on waking up at 6:30am anyhow. Did some work email then the morning ablutions before heading down to the Hallway Track to do the rest of the book editing. Did lunch with some folks (including Chris, David, Derek, and Jason), then more book editing, then a quick power-nap. Missed the afternoon break, but was back down to Hallway Track at 4pm until heading off to dinner at Original Joe's with Joe, Grant, Mark, and Tim. Having had large quantities of meat both Saturday (at Maceio) and Sunday (at Mark and Rach's), I opted to go the Italian route with chicken parmesan and a side of ravioli which was very tasty. Large portion, too, so no need for starter, salad, or dessert, which helped the per diem.
We got back to the hotel and there were no BOFs of interest to me at 7pm so I headed up to the room to clear some Nagios alerts (a couple of full disks) and handle some issues for work (a locked-out account and some monitoring tweaks). I did some additional internal paperwork (mostly "here's what I'll need to do to open tracking tickets for the resolved issues once I'm back on a usable network"), prepared for the ATW tomorrow, and turned in early.
Woke up early again around 6am, thanks in part to a kamikaze eyelash (ow). Went through the overnight email, answered some questions for a colleague in the School of Public Health's Biostatistics department, and caught up on Twitbook and LimeGerbil.
Tuesday's sessions began with the Advanced Topics Workshop; once again, Adam Moskowitz was our host, moderator, and referee. [... The rest of the ATW writeup has been redacted; please check my LJ and my web site for details if you care ...]
After the workshop, I went out for dinner at Paolo's with Dan R, Dave, and Lois. It's a high-end Italian place; I had the equivalent of a Caesar salad (whole crisp romaine leaves, parmesan-anchovy-garlic dressing, croutons, cracked black pepper) and ossobucco with saffron-infused farro and a roasted tomato sauce. Yum. Got back to the hotel in time for the 8pm BOFs but came upstairs to deal with some work-related issues before hosting the Alphabet Soup BOF. It was small this year (only 7 of us), in part because it was a late addition to the board, in part because it was in the last time slot, and in part because several of the usual suspects were absent from the conference this year.
The conference technical (as opposed to tutorial) sessions began this morning. My day began with the keynote session, which started with the usual statistics and announcements. This was the 24th annual Large Installed System Administration (LISA) conference. We had a new category of paper this year, the Practice & Experience papers, and the numbers I was given (accepted 27 of 63 total) don't match the ones given during the announcements (accepted 18 of 64 plus 9 of 18 P&E). By the opening session we had 1202 registrants which made the office staff and Board very happy.
Program Chair Rudi van Drunen gave the traditional thanks for the PC, coordinators (IT, tutorial, guru), staff, board, sponsors, and partners were followed by the speaker instructions, exhibition floor information, poster session information (15 accepted), and upcoming USENIX events. That segued into the announcement of Doug Hughes and Tom Limoncelli as LISA'11 co-chairs. He went on to discuss CHIMIT, ACM's Symposium on Computer-Human Interaction for Management of IT, which is held Friday and Saturday downstairs in the convention center as well.
We then went on to present this year's awards for Best Papers award:
- Best Student Paper — "First Step Towards Automatic Correction of Firewall Policy Faults," Fei Chen and Alex X. Liu, Michigan State University, and JeeHyun Hwang and Tao Xie, North Carolina State University
- Best Paper — "Log Analysis and Event Correlation Using Variable Temporal Event Correlator (VTEC)," Paul Krizak, Advanced Micro Devices
- Best Practice and Experience Paper (new this year) — "Internet on the Edge," Andrew Mundy, National Institute of Standards and Technology
Finally, Philip Kizer presented the annual Chuck Yerkes Award for Mentoring, Participation, & Professionalism to Edward (Ned) Harvey, who was not able to attend LISA this year.
Next we had our keynote speaker, Tony Cass, the leader of the Database Services Group at CERN working on the LHC. The talk, "The LHC Computing Challenge: Preparation, Reality, and Future Outlook," was somewhat of a followup to their 2007 talk on the preparation before going live. Back then:
He gave an overview of the collider environment, which in any given experiment can generate a lot of data, on the order of several hundred good events per second. The raw and processed data comes up to around 15PB/yr once they go live. Everything they do is trivially and massively parallel. Challenges they're facing include capacity provisioning across multiple universities' computing grids, especially with 80% of the computing being done off-site; managing the hardware and software involved; data management and distribution, especially given the quantity of both raw and processed data and the speeds possible for disks, networks, and tapes; and what's going on, in terms of providing the service and whether both the systems and the users understand what's going on, and where the problem (if any) is, requiring new visualization tools as well. This is an immensely complex situation with many challenges, most of which have been addressed with considerable progress over the past [10] years.
Back in the present day, it's performing well but they still have some challenges: They can't quite achieve the level of vacuum they need. The operating temperature required to keep superconductivity is 1.9 Kelvin; the much-discussed problem in 2008 was when an interconnection got too hot and broke the superconductivity. When the protons collide, it gets hotter than the heart of the sun in a miniscule space. Now they are colliding (isotopically pure) lead atoms. They're getting up to 600 million proton collisions per second.
So why do all this? Mainly scientific curiosity to prove or disprove the standard model of particle physics, to determine why different particles have differing mass, why our universe has matter but no antimatter, and so on. They're also developing new technologies, such as the web, touch-screen technology, medical diagnosis and therapy tools, security-scanning techniques, and so on.
What's some of the data they're collecting? They're getting something like 40 million collisions per second. They computationally reduce that to a few hundred "good events" per second, and that gets something like 5GB/sec (not the expected-in-2007 1GB/sec) of data, so instead of 15PB/year they're now forecasting to generate 23-25PB/year or 0.1EB in 2014.
The computing processing is spread worldwide, 33% at CERN, 33% at tier-1 centers worldwide (with plans for long-term data storage), then 33% at 130 tier-2 large universities (with no long term storage, just simulation and analysis). The 2007 challenges included:
- Capacity provision — The large-scale computing grid, but they had reliability problems and issues with collaboration.
- Box management — They created configuration management and monitoring tools.
- Data management and distribution — The data rates were expected to need 700MB/sec shipped around to the farm and 1-2GB/sec to tape, plus data to the tier-1 places.
- Monitoring what's going on — The systems are understood but the services are not and there remain cross-site issues. For example, how do you identify where the problem is (this is still a problem in 2010)?
The energy of te 7TeV beams is like two colliding trains and it could melt a ton of copper. They're running about half that, around 3.5TeV, now with fewer beams so they could only melt about 100kg of copper, Today they're running around 40 megajoules.
They depend on Oracle databases in there ways. First, they store short-term settings and control configuration in them. These are considered mission-essential: No database means no beam, and if they lose the database they stop the beams. Second, they keep short-term (7 day) real-time measurement logs in them. Third, they keep long-term (20 year plus) archives of a subset of the logs, currently at 2 trillion rows and growing at about 4 billion rows per day.
So much for the preparation and general operations. What about the reality of the current challenges? Sysadmins generally treat silence as praise. That being said, the research director said "Brilliant performances of the [sysadmins] was a key factor in the spectacular startup [of the LHC]." The chairwoman of the board publicly said positive things about the data processing speed which wouldn't be possible without the sysadmins. They've stored around 5PB this year, and are writing up to 70TB/day to tape. The only unused resources are at CERN itself; both the tier-1 and tier-2 places are used heavily. They're around 98% reliability for the grid — compare to the reliability of an airline, not an airplane, since jobs can be rerun.
They called out some operational issues they've learned: Storage is complicated. Hardware failues are frequent and cause problems for both storage and database systems; given the size of the computing grid at CERN, tier-1, and tier-2, disks fail every hour worldwide on average. Infrastructure failures, like the loss of power or cooling, are a fact of life. Software and networks, however, tend to be reliable.
CERN uses AFS, but some tier-2 sites use NFS which can cause bottlenecks. They created CernVM-FS as a virtual software installation with an HTTP filesystem based on GROW-FS to reduce bottlenecks (because like AFS it caches). This reduces the distribution overhead since not all files in the cache need to change with software upgrades.
Collaboration, especially consensus rather than control, is still a problem in 2010. As regards the data issues, they're 2-3 times above the planned data inbound from the collectors, and they're challenged by issues with file size, file placement policies, and access patterns. The future outlook includes:
- Analysis data should be on a unique file system for low-latency, and the input data stored elsewhere.
- Accessing hardware reliability issues in software, switching from parity RAID, and considering HADOOP at the tier-2s are all areas for investigation.
- Only a small subset of data distributed is actually used, so use automatic dyanamic data replication to send the most-used data all around.
- Network capacity is available. Consider copying data from site-to-site instead of recalling it from tape.
- Batch virtualization has a cost for users (low for some jobs but high for large-I/O jobs) but efficiency advantages for sites. However, multiplication of entities is never a good thing. Consider cutting out local workload management and just instantiate VM images dynamically that connect directly to the pilot job framework. Consider sharing VM images between sites (which requires trust).
In summary, preparation has been long (since 2000) and challenging both technically (especially a decade ago) and sociologically (consensus- and trust-building). Despite all that it's been successful, and they're capable of making improvements based on their experience with real data. It's been an exciting adventure so far and looks to continue to be.
One question that came up was dealing with bit rot. Regular readers of these reports will remember we've seen research that disks and their controllers lie, and it's possible for data to become corrupt on disk with no warnings or errors. They checksum the data to detect the problem and they slowly scan all the tapes to identify corrupted data (and then copy it back from the remote sites as needed). Similarly, another question was about pipelining; experiment-specific software ar the remote sites verifies that the data is correct and complete before processing.
In the second session, nothing grabbed my interest so I fell into a hallway track discussion with Richard, Peter, Nico, and Alan on a variety of topics. I remember one of the discussions was about the Michigan Terminal System (MTS); Richard's an MTS person from way back and we continued that discussion over lunch at Gordon Biersch. He also gave me the name of someone at U-M who might be able to answer some of the licensing questions I occasionally get in email.
In the third session, Dinah McNutt (who was incidentally the program chair of LISA'94, which was my first LISA conference) of Google spoke about her 10 Commandments of Release Engineering. These are sysadmins' commandments — solutions to requirements — to release engineers. The software distribution model is now more of a deploy-the-web-application than shrinkwrapped software, either COTS or custom; about half of the audience supports those webapps and the servers they're on.
As background, software release processes are an afterthought at best. There's often a big disconnect between the developer writing the code and the sysadmin who's installing it. A general build and release process is:
- Check out the code from the repository
- Compile and/or process the code
- Package the results
- Analyze the results of each step and report accordiningly
- Test post-build based on the analysis above
In general the process must have the following features:
- The process must be reproducible
- Provide a way to track changes and identify exactly what is in which product or component
- Uniquely identify the build, commonly kown as a buildID
- Implement and enforce policy/procedures, or document overrides, or both
- Manage both upgrades and patch releases
The commandments themselves are:
- Thou shalt use a source code control system. You should use that system to track revisions to your source code, build files, build tools, documentation (both internal to the build/release teams and external to users and customers), build results (for reproducibility months later), and build artifacts. Remember that the operating system, compilers, and tools all used to create this specific build must also be receratable.
- Thou shalt use the right tool(s) for the right job. For example, use make or one of its variables for C/C++ code, or the ant compiler for Java. Sometimes scripting languages are correct (such as the ./configure often written in sh, or Perl or Python tools).
- Thou shalt write portable and low-maintenance build files. Planning up front to have more than one operating system, platform, and architecture will save you heartache later. Use centralized makefiles with compiler options. Provide templates to make the build/release engineers' lives easier.
- Thou shalt use a built process that is reprodcible. It should also be automated, unattended, and reproducible (notice a theme?). You should adopt a continuous build policy, such as with Hudson for web applications.
- Thou shalt use a unique buildID. Again, the buildID should be some form of unique identifier, such as a string consisting of multiple version numbers (such as 1.2.3.4 for major release 1, minor release 2, patch level 3, build 4). As for the content you can use anything that uniquely identifies the build for later rebuilding; it can include just versions (with an ever-incrementing build number in the fourth field in this example), or a date/time stamp, or the software repository revision levels, perhaps the user running the bild, and so on.
- Thou shalt use a package manager. Package managers help with auditing; provide leverage for install, update, and removal capabilities; provide summaries of who did what when; and provide a manifest. Examples of this are apt-get and yum.
- Thou shalt design an upgrade process before releasing version 1.0. Decisions made in packaging can affect the ability to upgrade. Consider how to roll back if the upgrade or patch fails. Is there a downgrade process if there's some other reason to roll back an otherwise-successful upgrade?
- Thou shalt provide a detailed log of what thou hath done to my machine. Every operation (install, patch, upgrade, and remove) needs to log exactly what it did or will do. Providing a "do nothing but tell me what you would do" option is often helpful.
- Thou shalt provide a complete process for install, upgrade, patch, and uninstall. This process must be automated and provide for both rollback and roll forward. Packages should be relocatable, so it can run in (e.g.) /opt or /usr/local or /home/whoever and possibly not require root to install.
- Thou shalt apply these laws to systems administration as well as release engineering.
During the break I did a quick run through the vendor floor. I didn't want any of the schwag they had so since lunch wasn't sitting well and there was nothing that appealed to me talk-wise in this block I went upstairs for a power nap before dinner. For dinner, the 0xdeadbeef crowd — Brent, Janet, Dan, Bill, Adam, AEleen, Strata, Dinah, and I — went to Forbes Mill in Los Gatos. The food was good, but service was on the mediocre side. One oddity was that the mints were not available at the front desk (as one might expect) but in the mens (and I'm told the ladies) room(s).
Got back to the hotel late for the Scotch BOF. Hung out with the crowd there and had some yummy stuff (though of course I didn't write down what it was) until nearly midnight, then headed back to the room, wrote some of this report, and crashed.
I woke up with biological issues at 2, 4, 5, and 6, and finally gave up. Tried to reconnect to the hotel network but while I could get an IP address I couldn't get to the router or anything beyond it. I did manage to get network connectivity in the convention center lobby, so I'm confident it's something specific to the room; it turned out later that they'd disabled my connection for excessive use of bandwidth since the laptop downloaded patches overnight.
The first session this morning was Bill Cheswick's last-minute invited talk, "Rethinking Passwords." (Another speaker was to have given a talk but apparently was unable to get a visa to enter the country to do so.) As part of his credentials, he noted that he's number 98 on a list of the top 100 influential IT people, beating out Ben Bernanke (number 100).
We started with a list of various environments password complexity requirements, from length (minima and maxima) to content (lowercase, uppercase, numeric, special, spaces or not). These rules are supposedly to increase complexity to reduce their guessability. However, now that computers have multiple CPUs and we've got keystroke loggers and phishing attacks and password-database compromises, once the Bad Guys have the bits they can computationally break your password given enough time and CPU. So as a recommendation, use a password that (a) a friend or colleague can't guess and (b) which someone shoulder-surfing can't figure out from watching your fingers. Another recommendation is to not allow the use of common passwords.
When locking accounts, you shouldn't count duplicate wrong guesses. For example, most users type a password quickly then retype the same password again but slowly; that should count as one, not two, failed attempts. You should allow a password hint that the user can provide (for example, "Your strong password with the trailing digit"), and allow a trusted party to vouch for you (for example, someone confirming that their spouse is indeed who they say they are). FInally you should remind users of the rules so they can figure out which password they might have in use ("Oh, yeah, this is the site with ALL CAPS passwords"). This is an area for research but not much exits in the open literature.
We had a digression on entropy. 20 bits could be "pick 2 of the most common 1024 English words." Facebook and Twitter require 20 bits of entropy for their passwords; banks and other financial institutions want 30; governments and academic institutions want 40 or more. You can go to Bill's insult page for 42-bit passphrases.
In general, his advice for users is:
- Have 3 levels of passwords: No importance (like online newspapers), inconvenient if stolen (like vendor sites), and major problem if abused (like banks and medical history).
- Write down password reminders, not the actual passwords. For example, "Use the tier-2 password with a trailing digit."
- Use variations to meet stronger requirements (for example, password, password4, password4%, and so on, but don't actually use "password").
- It's okay to save the passwords in your web browser, but you'll accept the risk that if your machine is stolen there's now an attack vector.
His advice for implementors is:
- Get out of the dictionary attack game. Count and manage authentication attempts, but don't count multiple identical wrong guesses as more than one, since people with fast-finger their probable password then when denied type the same thing only slower. You can use pam_tally for this. You can slow down (a login delay that increases with each wrong guess) or block out addresses generating multiple failures. You can blacklist inquisitive IP addresses.
- Use a centralized authentication server (but replication is dangerous).
- Use near-public authentication services such as OpenID or Openauth.
- You can make the account names harder to guess.
- Disable password logins (use DSA keys).
- Use client certificates to limit the attack surface, requiring valid SSL first.
In conlusion, you want strong authentication not necessarily strong passwords. We need to do better, since the bad guys are getting better.
The second session this morning was Evan Haber's invited talk, "System Administrators in the Wild." Since 2002 a group at IBM has been studying sysadmins in the wild to better understand how they work, both to inspire improvements in tools and practices and to explain the ever-growing human costs in enterprise IT. As outsiders they were fascinated by what they learned, so they've written a book on the subject to explain our work to the rest of the world.
He started with the premise that sysadmins are important, but not as if he was sucking up to the audience of sysadmins. "The entire IT infrastructure underlying modern civilization would fall apart without our work," he said. Unfortunately we're also expensive: We're about 70% (and increasing) of the total cost of ownership, in part because hardware and software get cheaper every year, in part because of the increased complexity.
Anthropologists like him use ethnography; they live among and participate with their subjects in order to understand their life and priorities. There are limitations to ethnography: It's time and labor expensive and has a narrow population and time sample. Ethnograpy is about the real stories of real people. He gave an example of a live move of a database (with unfortunately poorly-shot video and low-quality audio). The anthropologists realized the risks involved in such things.
The rest of his talk went through the first chapter of the book he and his team are writing:
- People — It's very collaborative.
- Technology — It's very complex (and getting more so).
- Methods of coping — Processes and risk management.
- Tools — How SAs build things (scripts, documentation, webapps, and so on).
- Organizations — Sales, support, training, and so on.
- Communities — Orthogonal to organizations.
- IT Work — Tie all of that together.
Increasing automation allows for increasing complexity (with the same staffing) over time. He wrapped up the talk with an exhortation to let him know what's missing from the book. In the Q&A afterwards, we did. It was noted that the example was of a junior sysadmin in a disaster situation with a senior sysadmin who wasn't mentoring the junior, but that showing a video of a more-senior administrator competently fixing something wasn't compelling.
For lunch, several of us — Ijaaz, Jay, Richard, Robert, Stephen, and I — went to Peggy Sue's for burgers. We had to walk around the Veteran's Day parade (which seemed pretty small but we were pretty near the beginning of it, it seemed). We got back to the convention center in time for the third session of the day, Adam Moskowitz' "The Path to Senior Sysadmin." Rather than summarize his talk, I'll link to his speaker notes (PDF).
In the last session block I didn't need to see the panel on legal issues in the cloud, the invited track on centralized logging, or the guru session on project management, so I hallway tracked this block. There was a bit of punch-drunkedness going on from the conference IRC backchannel; Joseph from Google had asked for help generating the title of his talk with the subtitle "Preparing for Incident and Rapid Event Response." For the next several hours, any time anyone said anything it was in the form of a title with that subtitle. This meme spread to Twitter and Facebook as well, much to the amusement of the very tired sysadmins around the two tables in the convention center lobby. (I suspect "you had to be there" to get it.)
The conference reception started at 6:30pm, continuing the "adventure" marketing theme of the conference. The food was Indian(-inspired) and not bad, though there was nothing vaguely vegetable-like I could eat (and the fact that I, an unabashed and unashamed carnivore, pointed this out is interesting) and the hotel catering neglected to pass around 6 or 7 trays of lamb chops; the wine was mediocre, the games were silly (and they lied: the so-called "poison dart toss" didn't poison the darts at all), and I remain unconvinced that they'd licensed the theme music (though they said they did when I asked). It was about typical for the conference receptions of recent memory and nowhere near as bad as the 2001 reception with the circus theme held in the combustion engine-smelling parking garage with cold and incorrectly-cooked food.
Most people left the reception for the Google "Free (as in Free Beer) Beer and Ice Cream" BOF (though "Vendor Hospitality Suite" would be a more appropriate name) around 8:15pm, though they didn't open until shortly after their start time of 8:30pm, causing a bit of congestion in the hallway. (I later found out that Someone had demanded certain conference programming personnel attend BOFs other than Google's so they wouldn't be empty.) Got my free ice cream, schmoozed with some folks, participated in the raffles (and didn't win anything), then ran upstairs to change for a quick soak in the hottub before going to the evening staff party. The unpassed lamb chop trays from the reception were all here — after all, we'd paid for them — so I chowed down on some of them (and some cheese and crackers and chips and salsa and cookies and samosas and beer and wine and scotch were also available) before heading back to the room to crash.
The first session I went to this morning was Sean Kamath's invited talk, "10,000,000,000 Files Available Anywhere: NFS at Dreamworks." Like the keynote, this was some version of a sequel; back in 2008 in the Q&A for their talk, someone asked about the NFS scalability given their data needs. This talk is a result of that question.
They really do have 10 billion files. NFS is pretty much the best available solution for their needs; other alternatives, like FTP, rcp/scp, and rdist create multiple copies of the files which leads to questions as to which copy is authoritative. At the time this was built, WebDAV didn't scale, and neither AFS nor DFS was as fast or reliable and are still both less well-supported.
They have render farms and use NFS caching for the servers; the current infrastructure includes:
- Global file system namespace — Caches are used for single source with local representation, though there are local-only file systems for site-specific applications and some location-specific versions for, e.g., config files.
- Data storage hierarchy — Not all data needs the same requirements; for example:
- Active productions (many many many small files, many big files, and many oft-regenerated files)
- Semi-archived shows (sequels need access to the predecessor's data)
- Archived data/shows (though some have been restored)
- Application/development software
- Caching — Caching should:
The caching does introduce overhead; it's not complex but it needs good planning. It can add latency, but it's better than overloading a NetApp filer.
- Provide scalability
- Provide geographic accessibility
- Automounting — Automounting is heavily used. They identified a lot of bugs due to the scale. Automounting takes advantage of caching: They keep all maps in LDAP, use site-local variables to select the maps, use a global variable for the site and local variables for cache selection. In the USA alone, there are about 275 automounter maps with about 6,500 entries, all using 8 variables (like $LOCATION, $OSNAME, $ARCH, $OSREL, and $CACHE) now that there's a single LDAP domain. Automounting makes moving (sub)volumes around easier, but (a) they need to do it in all locations and (b) they used to need to take down the render farm to restart automount until autofs v5 was deployed.
One interesting side note is that everyone's workstation can see everyone else's desktop's /tmp via automount.
The challenges in general are:
- Stuff breaks.
- Sometimes data needs to be at another site.
- Much of the process isn't automated yet.
- They need real time disk utilization reports for departments, but it takes over 12 hours to scan the file system.
After the morning break (pastries and fruit), I next went to Cat Okita's invited talk, "Er, What? Requirements, Specifications, and Reality: Distilling Truth from Friction." She presented a humorous look at requirements and specifications: what they are and why you'd want them in the first place, with some ideas about how to create good requirements and specifications, recognize bad ones, and play nicely with others to end up with results that everyone can live with. One of the examples was "Build a Death Star from pumpkins." The desired end result was a roughly spherical pumpkin
of uniform denistycarved to look like a Death Star when lit from within, but the requirement specification was... less than ideal.Lunch was with Cat, Mark, and two others whose names I neglected to write down before I forgot them. We went to a Cajun place only to find they were dinner-only, so we went to Original Joe's; I had a sausage sandwich which was good if a bit mild.
The third session of the day was the invited talk "Reliability at Massive Scale: Lessons Learned at Facebook." This was unfortunately a pretty light-on-content talk. With 500 million active users and no plateau in their growth, they've got massive systems and massive loads. They roll out small code pushes to a subset of their users, but even 0.1% is 500,000 people. They use memcached heavily.
They talked about their recent (September 23 2010) outage, where they pushed two changes at once (a simple one and a complex one), which incidentally violated their "only change one thing at a time" policy, but it turns out it was the simple "trivial" JavaScript change that broke everything and not the complex risky memcached change. To make a long story short, the database got a bad value so the cache kept getting poisoned so the application servers all hit the database directly. They now have a better process to sanity-check new values before replacing the old value in the cache.
Facebook is fundamentally different: Everything is a social graph with information about people and how they're connected to other people. Everything queries that graph. Unlike traditional web sites where one user has specific data, on Facebook all the data is interconnected. If you're not careful, you can cascade errors from one machine/cluster to another.
The closing session began with the usual thanks: The USENIX Board liaison thanked all of the attendees for coming and the chair for his work, and exhorted us to come to LISA 25 in Boston in December 2011.
David Blank-Edelman closed the conference with the plenary session "Look! Up in the Sky! It's a Bird! It's a Plane! It's a Sysadmin!" He started by thanking the USENIX staff and our AV team (MSI), continued with a listing of his past talks and credentials as a comic book geek, and then in his talk drew parallels between superheroes in comics and sysadmins in reality. Unfortunately (from my point of view), he took too long to get to the point, took too much time for audience participation, and basically tried to shoehorn 5 minutes of content into 90 minutes of talk. I was disappointed (and unable to keep my sotto voce comments sotto) and wound up leaving after 45 minutes and before he actually got to his thesis statement.
After the session, I went to dinner with a small group — David, Susan, Kalida, Mike C, and Alexey [sp] — to Fuji Sushi. The five adults more or less split a large boat (albacore, yellowfin, tuna, halibut, salmon, shrimp, clam, octopus, and uni, with lots of daikon; all nigiri sushi and sashimi, no rolls). With the soup and tea it was still just over $40 per adult with tax and tip. After walking back to the hotel I did a quick pack prep before heading to the Hilton's quieter bar, where I shared some mint Timtams with Lee Ann, Nicolai, and Peter while drinking some scotch courtesy of Mario's per diem. Headed back to the room to finish packing what I could before heading to bed.
Today was a travel day. I checked email, resolved some issues, sent some email, showered, and finished packing. It occurred to me that I never set the alarm clock at all this week and was still awake by 7am (if not 6am) every day. Anyhow, I did some more work on the trip report before grabbing a cab and heading to the airport. A little confusion aside ("Cash only? The hotel said I could pay with a credit card"), I checked in and cleared Security with no trouble, got to the gate (very near Security and at the end of the terminal), and hung out using the free WiFi and at-seat power jacks to write more of this trip report until it was time to board.
The SJC-to-LAX flight was uneventful. We were a bit late leaving SJC (25 minutes late to board but only 10 minutes late to depart) yet managed to land at LAX on time. Caught the shuttle from terminal 3 to terminal 5 and grabbed a quick lunch before writing more trip report. For the LAX-to-DTW flight they moved me from the back of the plane (42C) up to the rear exit row (27C, with seats that recline), and nobody in the rest of the row. Ah, comfy. The plane had individual seat-back video screens with the automated where-am-I flight tracking (cool!), and WiFi access (but I'm not so addicted to the 'net that I needed to pay $13 for the privilege of using it). I like living in the future. I'd like it more if it wasn't so expensive though.
Landed, got to baggage claim, got my bag, got to the curbside pickup where the interterminal van was loading, and got to my car with a minimum of fuss and muss. Managed to get home a bit after 10pm EST, and was fed and unpacked by 11pm, where I went to bed in a not-entirely-successful attempt at shifting back to an Eastern time zone-based schedule.