Tuesday was the Advanced Topics Workshop, ably hosted and moderated once again by Adam Moskowitz and co-piloted by Rob Kolstad. The 26 of us went around the room doing introductions (who we are, what we do, what our last project was, and what the most important lesson from 2001 was). The introductions generated interesting questions and topics for discussions: Random opinions, the "undo" command for sysadmins, hot tools, and surprises from the past year.
We went through some of the topics of interest from last year and mentioned our opinions. People are indeed using SANs and NAS, since they're well suited to specific problems (such as archiving, Fortune 1000 companies, and so on). However they're not being used for general file services, mainly because the FibreChannel implementation is too expensive for general use.
We also discussed the centralization/decentralization pendulum which seems to be moving back towards centralization. Perhaps condensing is a better term, since places are condensing locations for their hardware and personnel but still keeping some geographic distance between them. Centralizing administrative functions is different than actual physical centralization, since (to use the SAN/NAS model), users don't care if the disk is local or across the continent as long as the performance is unaffected.
We're moving towards more of an ASP model within a given environment, be it company or infrastructure. The ASP model works well between divisions within an organization but not as well between different organizations, primarily due to trust issues.
The events of September 11th caused a shift in the thinking of some of the tight-fisted financial staff. They now realize how integral computing is to business, so colocation and backups are now more important.
The next major topic was mobility. Without mobility today's commonplace high-speed network infrastructures and reliable file servers make a lot of system administration fairly easy... workstations can be built from images or automated installation processes and all mutable data lives on centralized file servers where it's easy to back up and manage.
But mobility changes all that. Mutable data has to be local to the end-point (laptop, etc.), we can't expect network connectivity to be high-speed, and we have to be able to deal with connections over insecure networks. We have to deal with a host of security issues, find new ways of ensuring data availability, and be able to provide the needed services of various levels of network quality.
Mobility is becoming increasingly important... there are now many organizations where most end-points are mobile platforms. But IT infrastructures have not yet caught up to this changing reality. To deal with this we will have to abandon our traditional (and previously successful) modes of thinking and use technologies that involve disconnected operation, mobile IP, synchronization, transparent data encryption, and so on.
Wireless computing has changed our behavior; 70% of us in the ATW are on laptops. Our expectations seem to be that we're approaching ubiquitous computing; of those using laptops, about 2/3 use them to access remote services (mail, web, files) and 1/3 use them as the centralized storage point. This leads to the intrusion of mini-environments into your own macro-environment: Managing laptops, which can move from administrative domain to administrative domain (and pick up and distribute viruses and whatnot), and keeping them from screwing up your environment is a hard problem.
Recovery-Oriented Computing (or ROC) is targeted to services. A PowerPoint presentation is available, along with information at http://www.cs.berkeley.edu/~pattrsn/.
The goals are ACME — Availabvility, Change, Maintainability, and Evoluntionary growth — instead of performance (which is what we've looked for in the last 15 years). And we're not doing that well.
One of the aspects is not just to get real data to improve reliability but to measure reliability and availability. Making the system administration tasks have an Undo function may be helpful. Think about the three Rs: Rewind (go back in time), Repair (fix error), and Redo (move forward again). We're looking to recover at the service level, not just at the server (hardware or component) level.
- Predictability — Having predictable recovery aspects would be a huge improvement even now. Most recovery plans (or even risk mitigation) is pure guesswork now, based on experience and trial and error. Change Control and Change Management needs to be more formal and actually predictive or detailed determination.
- Avoidability — Can you avoid the problem to reduce the recovery time? If you can avoid the problem then the need for recovery is less. This is reasonably important and very hard.
- Repeatability — Making tasks easily repeatable will help reduce complexity and can lead to increased avoidability and thus increased reliability.
- Risk Mitigation — A lot of the changes we make at one time — one change — affects multiple machines (such as servers, routers, switches, firewalls, and so on). Rollback within any one system is good, but we need to have rollback in all of them. The problem becomes system-specific; is it a GUI or Cli?
- Tools — They're trying to reduce the MTTR in the MTTR/MTTF equation. This project is more about building recovery-from-something-that-has-happened than making-the-problem-less-likely-to-occur.
Right now the thought is to build a sample (prototype) email system as a starting point.
What about security breaches (intrusion detection)? Something similar can be done; this kind of technology would be good. You could roll back to before the intrusion, install the filter or preventative mechanism or whatever, then roll the good stuff back in again.
Simply changing (fixing, simplifying, etc.) the interface is insufficient. Work does need to be done on SA recovery interfaces but this is beyond the scope of the ROC project.
Next we discussed the new tools, technologies, ideas, or paradigm we're investigating or using. The list included new IP telephony products; tricks for ssh and CVS; wireless networking; integration and aggregation of alarm, monitoring, and administrative functionality with automation; reducing information replication; load balancing; anomoly detection; miniaturization; mirroring network storage for high-speed failover; VMware; MacOS X; Java; and Perl 6. The list also included business problems as opposed to technology problems.
One side discussion was about programming languages. Some people like Java, others like C#. Java is the new COBol in that it's the new business language but not a system language. Some debate ensued, with no conclusion, about whether to teach C, C++, Java, or even Scheme first.
Several people mentioned surprises they'd had in the past year. This list includes Cygwin, the PC Weasel, the dearth of middle-men in the DSL/POP/ISP markets, and the number of people running wireless networks without any security.