Tuesday's sessions began with the Advanced Topics Workshop; once again, Adam Moskowitz was our host, moderator, and referee. We started with our usual administrative announcements and the overview of the moderation software for the five new folks. Then, we went around the room and did introductions.
Due to a variety of reasons, several of the Usual Suspects weren't at this year's workshop. Despite this, in representation, businesses (including consultants) outnumbered universities by about 4 to 1 again; over the course of the day, the room included 5 LISA program chairs (past, present, and future, up from 4 last year) and 9 past or present members of the USENIX, SAGE, or LOPSA Boards (down from 11 last year).
Our first topic was a round-the-room about the biggest problem we'd had over the past year. Common threads include career paths, such as whether to leave systems administration for management or development, staying in systems administration and how to motivate both yourself and your employees in a stagnating position and finding a job that's a better fit; reorganizations and lack of structure; doing more with less; and writing tools to automate tasks.
Our next topic was a brief comment about the effective expiration of RAID 5 as disk sizes increase. When you have a 5+1 RAID 5 array of terabyte drives, recomputing the checksum during recovery requires reading 5TB of data; any unrecoverable error means data loss. Using 2TB or larger disks means that the odds of unrecoverable errors rise to 80% or higher. Andrew Hume explicitly said, "By the end of 2009, anyone still using RAID-5 storage on large drives is professionally negligent."
We next discussed storage. There was some question as to the best way to make very large data sets available to multiple machines at the same time. Some sites are stuck with NFS version 3 due to interoperability issues. The consensus is that NFS is like VHS: It's not the best technology, but it's what we've got: Use it if you want it, or don't use it and write your own. If you're doing high performance computing (HPC), GFS, Lustre, and ZFS may be worth investigating, depending on your requirements. The consensus on using iSCSI heavily in production server environments is "Don't."
Our next discussion was automation. We started with automation of of network configurations, since all the good solutions now in that space cost money. There should be a free tool, like cfengine or puppet, explicitly for networking: It should inspect your environment and all its configurations. The goal here is more about managing racks, power, load balancers, VLAN configurations, ACLs on the switches, NAT on the firewall, updating the monitoring (Nagios) and trending (MRTG) tools. Other automation tools mentioned include Augeas and Presto.
The mention of integrating with monitoring led to a discussion as to when it's appropriate to add a new server or service to your monitoring system. One argument is to deploy both service and monitoring update at the same time, with alerts silenced until it goes into production because it tightly couples deployment and monitoring. Another argument is only to add the new service to monitoring when it's ready to go into production, though this is harder to automate in a one-stop-shop mode, since there can be a long time between initial deployment and going into production. There is no single right answer since the pros and cons tend to be environment-specific.
After our morning break, we resumed with a round-robin list of the latest favorite tools. This year, the list included BLCR, C++, cfengine, git, IPMI, MacBook, pconsole, pester, pfsense, Roomba, Ruby, SVN, Slurm, TiddlyWiki, Tom Binh backpacks, virtualization (including OpenVZ and Parallels), and ZFS.
Our next discussion was on cloud computing and virtualization. We're seeing it more and more and are wondering where the edge cases are. It tends to work well in the commodity level but not for certain services (such as file services). Managing virtual machines can be problematic as well. Some people are too optimistic with what they think they can gain; some folks are over-allocating machines which can lead to outages. Managing the loads is a hard problem, since the tools may not show accurate information. HPC shops don't see much gain from virtualization since they tend to be very computation-intensive, and the ability to relocate virtual machines to different physical machines may not be enough of a gain to be worthwhile. Consensus was that virtualization is more useful in smaller development and web service shops, especially in providing QA or test environments that look like their production counterparts. It's also useful to try something new: Take a snapshot, then work on it, and if you destroy it (or the software install wipes it out, or the patch blows up horribly) you can roll back to the snapshot you took. Finally, server consolidation (especially in data centers) and reducing power consumption is a big driver.
We next talked about career satisfaction and the lack thereof. Some senior folks (in both engineering and operations sides of the shop) are writing policies instead of doing "real work." This is similar to the shift from technical to management career paths; it works for some, not for others, in part because the work itself is different and in part because the reward is different. There's some concern that as we age, we may lose touch with technology, in which many of us have bound our self-identity or self-worth. This is more problematic for those without similarly-inclined peers to discuss issues with. Part of the problem is also that we as a profession are moving from server- or service-focused roles to more of a business focus; we exist to keep the business running, not to play with the cool toys. Some people have come back to systems administration from management and feel that having the experience on the business side has been a huge benefit and makes them better systems administrators. Additional satisfaction can come from mentoring.
This segued into a discussion about when it's appropriate to break policies when they're preventing the work from getting done. The summary is that rewriting them to avoid the problem or amending them to allow for IT-based exceptions was the best course of action.
After our lunch break, we talked more about monitoring. The best practices seem to include using both black-box (closed systems pretending to be the user) and white-box (collect statistics and analyze them later) monitoring tools. Also, keeping historical data around is required if you want to do any trending analysis or need to audit anything. One argument is to capture everything since you don't necessarily know what you'll want next month or next year; however, the counter-argument to just capture what you need for business purposes has advantages of using less disk space and making data unavailable for legal discovery later. It all depends on what you care about, and what level of failure in data collection you can live with. Clusters have interesting challenges; how much of your limited CPU (and cache and network bandwidth and so on) are you willing to allocate to monitoring, since that impacts your ability to process real data? Monitoring should not be an afterthought but an integrated part of any solution.
It should be noted that monitoring, checking the availability or function of a process, service, or server, is a different problem than alerting, telling someone or something the results of a monitoring check. This led to a discussion about not getting alerted unnecessarily. In one environment, the person who adds the monitoring rule is responsible for documenting how the Help Desk escalates issues, with the last-resort rule of "Contact the developer." This becomes more complicated in multi-tier environments (are you monitoring in development and QA as well as production?) and in environments with no 24x7 support access.
Maybe 5 of the 30 attendees were satisfied with the state of the monitoring in their environments.
Our next discussion topic arose out of the previous satisfaction issues involving bad policies. The right answer in most cases is that the policies need to be fixed and that requires escalating through your management chain. Most policies in this context boil down to risk management for the business or enterprise. Security as it affects risk management needs to be functional instead of frustrating; there needs to be an understanding of the business needs, the risks, and how to work both within and around the policies as needed. We need to move away from the us-vs-them mentality with Security for things like this. Even getting written exceptions to the policy (for example, "No downloading of any software is allowed, except for the IT group whose job it is to do so"). Note also that some suppliers are more trustworthy than others, so some stuff can be fast-tracked. Document that into the policy as well. Policies should have owners to contact for review or explanation.
Next we did another round-robin on the next big things on our plate for the next year. For us, it includes automating manageability, backing up terabytes per day, building out new and consolidating existing data centers, centralizing authentication, dealing with globalization and reorganizations, designing a system correctly now to deploy in 3 years, doing more for and with less, excising encroaching bad managers from our projects, finding a new satisfying job, mentoring junior administrators, moving back to technology from management, remotely managing a supercomputing center, rolling out new services (hardware, OS, and software) the right way, scaling software to a hundred thousand nodes, transitioning from server/host-based to service-based model, virtualizing infrastructure services, and writing policies.
After the afternoon break we had a brief discussion about IPv6. A little less than half of us are doing anything with it, mostly on the client side. The consensus is that there's no good transition documentation explaining what providers need to do to transition from v4 to v6. It was noted that you need to say, "Here's the specific thing we need IPv6 to accomplish," you'll be able to move forwards instead of being thought of as the crazy one.
Next we discussed chargebacks; people seem to find it mostly useful. Some places have problems with their internal auditors. It was noted that chargeback encourages perverse behavior, such as what's easiest to measure and not what's necessarily useful or desired. Also, some departments tried to charge the time it took to convert to a new system to the department that rolled that system out. Some want to use chargebacks to provide accounting and force departments to forecast quantities, such as CPU-time or disk space utilization. Charging for the technology resource at some rate (such as per CPU hour or per gigabyte per month) tends to work well, but that cost needs to include the human cost yet not be so large as to discourage users from using your service and encourage them to do it themselves.
Our next discussion was on professionalism and mentoring. How do we attract new blood into systems administration? There's no good answer; in some environments, clearances are needed; in universities, many of the technically-interested people go into development than systems administration. Hiring student interns that want to be systems administrators can help (if you're in a position to hire students), or going to local user groups, but good people are hard to find.
It may be that market forces will help; the demand for systems administrators will drive up salaries in the long run. In tandem, recent articles are that system administrators, network administrators, and database administrators are recession-proof jobs. But money talks. However, it's hard to get quality if you're interested more in money than the work. There's also conflation of the term "systems administrator:" Is it working with big, cool machines, or supporting users, or fixing printers, or being the computer janitor? People are starting to recognize that professionalism is important. Expectations as to IT staff behavior are higher than in the past: Use full sentences, be polite, answer questions, and help solve problems.
This boils down to how we get people into the profession. They're already maintaining their own desktop and so they're not seeing any of the cool side. People come in through the help desk and programming, but what other vectors are there (and how common are they)? It used to be hard to provide certain services that are now trivially easy. For example, mirroring is easy now (rsync and cheap disk) where it wasn't before.
Our last discussion was on power consumption and "green" computing. Many places are running out of space, power, or both, and need to increase efficiency. Most non-HPC places are just starting to look at the whole issue, though there's general consensus that it makes sense, both in terms of environmental issues (reduce, reuse, recycle) and economic issues (lower power bills, more instructions per kilowatt). Suggestions included default to duplex printing, powering off desktops and monitors overnight, raising the cold aisle temperatures 1 degree in your data centers, running 3-phase 208V power, virtualization of services that can be, and not allowing "throw hardware at it" as a solution. Low-power CPUs and variable-speed disk drives may help out as well.
This year's Talkies Award goes to DJ Gregor. Last year's winner, David Williamson, was not present; Andrew Hume, a former Talkies Award winner, was in the bottom 5 this year. (And on a personal note, I actually managed to get my name in the speakers queue on a relevant issue, surprising our moderator.)