Fri, Jul 01, 2005
Blueprints for High Availability
Well, this isn't your average QA job. I work in the deployment group. Our job is to make sure that products work when they're... well, deployed. As in, sure, all of your software tests passed in the lab, but what happens when we try to deploy it in the Real WorldTM?
See now, that's a sysadmin's dream -- getting to play with lots of cool technology (firewalls, load balancers, clusters, SANs, app servers, database servers) in an environment where breaking something not only isn't a disaster, it's actually part of your job.
Of course, if you've never built a cluster before, there's a bit of a learning curve. So I decided to start where I always do when confronted with a new technology: the bookstore.
Since much of our work is focused on High Availability and Disaster Recovery configurations, I thought I'd start there. The first book I picked up was Blueprints for High Availability, Second Edition by Evan Marcus and Hal Stern. The authors ought to know what they're talking about: Marcus is a principal engineer at VERITAS, and Stern is the CTO of Sun.
The book is structured around a list of technologies and practices arranged into an "Availability Index." Think of it as the OSI model for HA. At the bottom are the fairly straightforward things that everyone should be doing to ensure availability such as buying reliable hardware and making regular backups. Each layer works toward increasing levels of availability (and cost) with technologies such as clustering, replication, and failover. And they're right, they really are layers -- there's no point wasting lots of money building a global cluster to fail over between geographically separate sites if you haven't invested in fault-tolerant storage and redundant network connectivity.
Having described the Availability Index, the authors provide a general introduction to the field, including its jargon (e.g., MTBF, MTRR, sigmas, and nines). Especially helpful here is a chapter on "The Politics of Availability," describing how technical personnel can get management buy-in. This is important, considering that (a) sysadmins aren't always particularly good at communication, and (b) HA technology tends to be expensive. If you haven't built the case for availability, expect resistance.
Assuming that you've gotten the green light, we begin working our way up the Availability Index, starting with (my favorite) good system administration practices such as change control and consistent system configuration. The next three chapters cover storage management issues, including backup and restore, volume management, and RAID. The final chapter on storage is a good introduction to SANs, NAS, and storage virtualization (especially helpful to me as a relative newbie to "enterprise storage" -- I didn't know you could do all that.)
Having taken care of local storage, the next chapter takes on networking, discussing the different ways in which networks fail and options for building redundant networks. Redundant network connectivity leads naturally to a discussion of Data Centers and environmental issues such as racks, redundant power, and cooling.
The chapter on environmental issues ends by discussing something completely different: system-naming conventions. This happens to be one of my pet peeves, and I think the authors are dead-on. Too many try to name their systems using some sort of code: if you're working in the Orange County office and you have three machines running AIX, please don't name them oc-aix01, oc-aix02, and oc-aix03. That kind of thing may work well for network equipment such as switches and routers (after all, what's important about a network device if not its location) but it's a horrible idea for systems: it's hard to remember, and it's hard to communicate. So you're on the phone, in the middle of a noisy data center, and you're under pressure to get things back up and running immediately. Now which one were you supposed to reboot -- was it oc-aix02 or oc-aix03? I can't remember. Damn...
On the other hand, if all of your machines have "real" names (cartoon characters, say) are you really going to forget whether it was linus or snoopy that you were supposed to be working on? And if you were planning to tell me that you're encoding important information in the names (e.g., the OS they're running), I simply counter that you lack imagination: name the AIX boxes after cartoon characters, the Linux boxes after characters in Lord of the Rings (good Lord, people are using thse for baby names?), and the Oracle servers after Navy ships .
Whew. Ok, back to the book. The next chapter discusses people and processes for availability, including maintenance plans and vendor management. The discussion reminds me of another of my favorite books, Limoncelli and Hogan's Principles of Network and System Administration. Actually, come to think of it, so did the previous chapter. Do yourself a favor and get both.
The next couple of chapters cover issues with applications, including the special requirements of NFS servers, web servers, and database servers. The authors describe the different kinds of things that can go wrong with applications (memory leaks, network connectivity issues, buffer overflows, hung processes), as well as techniques for sharing state among multiple instances of an application and checkpointing in case of a failure.
Finally we reach the heart of the matter, at least as far as I'm concerned: there are four chapters devoted to clustering, failover, and replication. You won't learn everything you need to know (and in fact, if what you need to know are technical details of particular products, you won't learn anything), but it's a good introduction to the components of an HA system (virtual IP addresses, shared disks, heartbeats), the options for configuring clusters (active-passive, active-active, service groups, N-to-1 vs. N-plus-1), and the issues such as fail-back and split-brain.
Next comes a short chapter on "Virtual Machines and Resource Management," which seems out of place. Perhaps its material should have been relocated to the chapter called "A Brief Look Ahead" on future trends such as iSCSI, InfiniBand, and grid and blade computing. The book was published in 2003, but much of this technology seems to still be in its infancy.
The flow picks back up with the final major chapter, on Disaster Recovery. This chapter is much less about technology than about planning and logistics. It's important, I don't deny it, but personally I was looking for a discussion of global clustering.
Finally, the Second Edition was written after the attacks on September 11, 2001, when businesses really started to think about not just HA, but Disaster Recovery and Business Continuity. This edition includes a chapter called "The Resilient Enterprise," describing how, despite losing their offices and even the trading floor itself the morning of September 11, The New York Board of Trade was able to recover and be ready for business by 8pm that evening. Now that was a Disaster Recovery Plan.
This isn't a terribly technical book, but it's a good introduction. If you're just getting started, start here. If you need to configure a cluster, start elsewhere. The next book on my stack is Shared Data Clusters, by another engineer at VERITAS, and it appears to contain more technical details.