Tuesday's sessions began with the Advanced Topics Workshop; once again, Adam Moskowitz was our host, moderator, and referree. We started with an overview of the revised moderation software and general housekeeping announcements. (Well, we really started by picking on Adam. And Andrew Hume, who eked out 2005's most talkative participant award, and Trey Harris, who was 2004's most talkative by a factor of 2.)
We followed that with introductions around the room. Due to a variety of reasons, several of the Usual Suspects weren't at this year's workshop; in representation, businesses (including consultants) outnumbered universities by about 3 to 1; over the course of the day, the room included 3 LISA program chairs (one each past, present, and future,, down from 5 last year) and 5 past or present members of the USENIX, SAGE, or LOPSA Boards (down from 7 last year).
We went around the room to say how we believed system administration has changed in the past year. The general concensus seemed to be autonomic systems; challenges of new jobs; education of teachers to bring them up to what students knows; improvements in automation; life-event changes, with marriages, deaths, and births; lower budgets; metrics; more fallout from legislation (like SOX); more reliance on external infrastructure, such as external mail, calendar/scheduling systems, and Wikis; organizational restructuring and staff turnover; targeted offshore security attacks; telecommunications integration; and virtualization on the rise.
Our first subject of discussion was storage. Several people have larger and larger storage needs; one example is a site where they're growing at 10TB a month or 2.5PB a year, and smaller (such as 16GB) drives just don't scale any more. Other places are more than doubling every year. We discussed some options like the promise of iSCSI, ZFS (which those who've used it are pleased with for the most part), and the forthcoming open source GFS-like. The comment about determining your real needs and taking metrics is important: some 1/10 gig switches can't really switch between many users, and if you're not measuring end to end you won't know where your bottlenecks are.
In addition to the primary storage needs (how much disk, how is it attached, what bandwidth do you need, and so on), there are the other ancillary issues, like backing it up: Do you store snapshots on disk, do you back up to tape or to other spinning disk? How much is really enough, given that nobody ever deletes data? One point was to use software compression before the data hits the tape drive; for example, the hardware compression can require 90 MB/s for 3:1 compression.
Another point was that if you do the math on ECC corrections, we're now having enough disks that one site in particular is seeing bit rot in untouched files on spinning disks at about 1 error per several terabyte-years (1TB spinning for 1 year, or 2TB for 6 months). Yes, the rate is very very low, but it's definitely non-zero, so if you don't checksum everything always you have the risk of bit rot losing or corrupting your data on disk. (Which leads to the question about where you store your checksums and what happens if they get bit rotted?)
We digressed into a brief discussion of backups — do you back up files just at the OS level or at the application level as well? Do you back up laptops? Rates of what to back up can differ between the data types: the OS tends to change less frequently than home directories, for example. Finally, consider not backing up what you don't need (for legal, compliance, regulatory, and similar reasons). It's recommended that if you don't have a policy, you should write one yourself then get approval or rewritesfrom your legal or compliance folks afterwards.
Our next large topic area for discussion was monitoring. We went around the room; 29% are using some kind of home-grown monitoring software, 83% are using open source tools, and only 8% are using commercial software (numbers won't add up to 100% as several places use combinations). The software packages expilicitly mentioned include Big Brother, Cacti, cron that sends email on unexpected errors, home-grown syslog watcher, logcheck, MRTG, Nagios, NetCool, Net Vigil, OpenNMS, RRD, smokeping, and Spyglass. Most people tolerate their monitoring, and very few are "very happy" with it. Nagios in particular had the larges representation and the concensus seemed to be "It's the best of the bad choices," and while most of us use it, nobody was an evangelist for it. In general, the suggestions are:
- Monitor what does happen that shouldn't.
- Monitor what didn't happen that should've.
- Monitor what you care about, don't monitor what you don't care about.
- You need history: We may not care about one ping down, but we do care about multiple successive failures, and then we won't care until it comes back and stays up (the success table).
One problem is we need more detail than just "up/down." Nagios as written doesn't differentiate between several states: is it there (ping), does it have a heartbeat, did it give a response, did it give a valid response, and did it give a valid and timely response. The phrase "service is up" isn't necessarily meaningful. We discussed what we'd want in an ideal monitoring system, including cooperative signaling, so "if I take 10 minutes it's okay, if it's longer there's a problem" is a valid case.
Another issue we have with Nagios is that it often doesn't monitor the right things, or performs the wrong tests. Who writes your tests: Is it the person responsible for the application, or a monitoring group, or someone else? The actions taken also need to be aware of criticality: How urgent is the problem, how often should it be tested for, and so on.
This led to a discussion about machine learning (monitoring tools that build or configure themselves) and self-aware applications that can determine on their own if they have a problem and to send alerts themselves. Better application design can lead to better monitoring.
After our lunch break, we went through and mentioned tools new to us as individuals since last year's conference; the tools included Adobe Lightroom, Asterisk, Aware I Am, decoy MX server to block spammers, DocBook SGML, Dragon Naturally Speaking, drupal, Google Spreadsheets, hardware security monitors and crypto ("key roach motels"), IP KVM, IP power, IPMI cards, isolation booths at work for private phone calls, LYX, Mind Manager for mind mapping, Mori, OpenID, Password Safe, photography management (for births and weddings), rails has been useful for admin interfaces, relationships with intellectual property lawyers, RSS feed-reading software, SQL Gray, Solaris 10, Solaris Zones and ZFS, Sparrow, USB-attached RFID readers, VOIP, wikis (because "they work now"), and x2vnc and x2x.
Next we talked in more detail about ZFS. Someone asked if it was as wonderful as the hype said it would be, and the answer boiled down to "Yes and no." For the most part, it's very very well designed. It does what you want, and even though it sounds too good to be true it's pretty close to that. However if you use it long enough you'll see the warts. It works well with zones, but not everyone at Sun support knows enough to troubleshoot problems yet, there's only one commercial product to back it up yet (Legato), there aren't any best practices yet, and there's no way to say "Evacuate this disk and give it back to me" yet.
Next we discussed calendaring. As a group we use a lot of software and at best we tolerate it. The big one is Exchange's calendaring on the PC side and iCal on the Mac. We came up with a feature list of a good system, which included multi-OS, specifically Mac, Windows, Linux, Solaris, HP-UX, and *BSD; integrating both home and work calendars, keeping them separate so other "home" users (such as spouse and kids) can only see the "work" entries as "busy" without details; being able to see free/busy on others' calendars and to schedule events with negotiation which requires ACLs of some kind. There's still no good solution yet.
We next discussed cheap scientific clusters. Now that there are quad-CPU dual-core processors, someone built an inexpensive yet effective 4-node (soon growing to 10-node) cluster with Infiniband Infinipath for internode communication and gigabit TCP/IP for networking. They use RAID 5 on the head node of the cluster; each node has 32GB RAM. They can almost get rid of their decade-old Cray (any job can use up to 32G memory; the Cray has 40G). It's doing better than they expected but it's very noisy.
This led us to a discussion about power consumption and heat generation. One site recently got a supercomputer grant for hardware that needs 300 tons of cooling and their entire data center only has 45 tons, and their entire campus doesn't use as much power as this one supercomputer will (once it's fully loaded). Going to virtual machines reduces power and heat by using several smaller virtual machines on one larger machine. Some articles say that DC power helps some, since you can avoid converting between DC and AC. There's not a huge market for better power consumption yet, mainly because few people in the purchasing processes are discussing it, but if you require low-power, low-voltage, slower-but-cooler hardware in the hardware selection process, possibly by specifying "total wattage" instead of a number of systems, the market will auto-correct and give you more options. Other suggestions for reducing power consumption and heat generation included replacing your CRTs with LCD flat panels, using thin clients in conference rooms and secretarial desks where you don't need an entire PC (which has positive side effects on security), replacing desktops with even permanently-docked laptops, and replacing incandescent lights with compact fluoroescent lights (CFL). Any and all of these can reduce your power costs, cooling costs, and fan noise.
After the afternoon break, we talked about support changes. As has been the case in recent years, more places are trying to do more — more services, more products, more projects, more hours of support — with less resources — fewer or the same number of people, fewer machines, and so on. In general folks are accomplishing this by remote access (ssh into corporate environments, remote VNC to client or customer machines where I support you from my own desk). There is also the issues of who supports home machines: they're used by the home and the corporation, so it doesn't fit neatly into most places' support categories. It should be noted that supportability implies backups.
We next went around the room to discuss our most important or most difficult problems. This year, the big one was resource allocation: both insufficient staff in both quantity and training and insufficient time. Finding people is hard, keeping people can be hard (they quit or are reorganized away from your team), and cross-team communincations is often hard. There are often too many fires to put out so prioritizing which fire gets fought first is necessary. The other most common problem is the learning curve; several of us are in new environments and it's challenging first to learn what they did and why, and how things got into their current state, and then to improve things to use best practices; many resist change management, even at the level of "Tell someone when you change something." The third most common problem is career management: what can we do when we're getting bored with our current role, or if there's no growth path to "senior engineer"? Finally, compliance (for legal issues, such as HIPAA and SOX) is taking up more of our time; about 25% of us are doing more with it now than last year.
Finally, we discussed what's on our horizon, or what we expect the next year will be like for us. We predict that our challenges for the next 11 months will include application and OS upgrades back to the bleeding edge; clustering; compliance; exponential scaling; leading understaffed teams and dealing with staff retirement; making the infrastructure more reliable, more robust, and more reproducible; virtualization; and working with 10GigE.