Five ways to bulletproof your infrastructure ;).
There’s no such thing as 100 percent uptime, but following these tips can get you a lot closer to the goal
I’ve been on a troubleshooting kick the past few weeks, both in my blog and in the real world. Simply put, I’ve been inundated with a plague of IT problems.
Well, these things happen. No technology or process in the world can eliminate all future outages, defective code, or random human foolishness. but you can hedge your bets. Sure, you can spend beaucoup bucks on a fully redundant infrastructure, but short of that budget-busting scenario, a few small steps can greatly simplify recovery from all sorts of problems.
Bulletproof your infrastructure tip No. 1: Keep cold spares of everything
Ideally, you’ve already standardized on network and server components. Sure, there may be a few odd parts here and there, but your closet switches should all be the same brand, if not the same model. Your servers are homogenous or at least homogeneous to their purpose (such as HP ProLiant DL360s for one major infrastructure component and Dell PowerEdge R415s for another). These servers aren’t that expensive, especially if they’re purchased in their minimum configuration. In a pinch, you can replace a failed server with the cold spare, moving the functional parts over to the spare in a jiffy. In some cases, you’ll even be able to simply swap the disks and have the new box up in no time.
For routers and switches, the same is true. With tools like RANCID to automatically download and archive switch and router configurations, you can dump the configuration of a failed router or switch to the cold spare and save the day. Firewalls work the same way. In many cases, you can even pull your cold spares from eBay auctions and get them cheap: You don’t care about support on these units, so you can forgo that expense and still cover your needs. Even if you’re running Cisco ASAs, you can probably find an end-of-life Cisco PIX with a similar configuration for a few hundred dollars that can at least bring critical services back up if you experience a failure.
Naturally, you don’t want to buy cold spares of big-ticket items like core switches, but if you do a little legwork, you can cover the rest without putting a major dent in your budget.
Bulletproof your infrastructure tip No. 2: Go wiki, baby
What was the serial number of that remote-office switch anyway? What version of IOS was that router running before the power supply blew? I find that the easiest way to collect this data in a way that’s easily located is in a wiki. Toss CentOS on a virtual machine, install MediaWiki, and start compiling data on your infrastructure. I paste the output of sh ver on a Cisco device straight to a wiki page as well as write up synopses of the switches’ functions and responsibilities; in the event that something goes awry, I can quickly dig up those ever-so-necessary bits of information that can turn a three-hour recovery into 30 minutes.
I don’t go so far as to put passwords in wiki documents, but anything short of that is fair game: lists of serial console server ports and what they’re connected to, switchport assignments and VLAN blocks for DMZ and public switches, as well as each server, its brand, model, serial number, role, storage, and RAM configuration, and so forth. If it exists in your infrastructure, it should have an entry in the wiki.
Starting this project from scratch is a real pain, but maintaining the information on an ongoing basis is easy. The next time you have an immediate need to know the serial number of a failed remote switch, you’ll have it right at your fingertips.
Bulletproof your infrastructure tip No. 3: Establish backup links wherever and whenever possible
If at all possible, there should be multiple paths to every data center and remote office. Back in the day, this was very expensive, but now you can probably get a business-class DSL or cable connection to most of your locations. For less than $100 a month in many cases, you have an alternate access method to that site for use in emergencies — or for sensitive remote configurations of the production routers and firewalls. It might even be feasible to split your traffic in those sites, pushing business traffic over leased lines and Internet browsing traffic over the DSL or cable circuit.
If cost is the ultimate issue, you can take a page from the first item in this list and procure a used firewall from eBay for this circuit. Because it’s not production, you have less concern over the reliability of the device, so a used piece of gear is a good fit for a tight budget.
Bulletproof your infrastructure tip No. 4: Bet on a big box
This one really applies to virtualized infrastructures only. Say you have a virtualization farm of a dozen 1U servers running a few hundred virtual machines. If something goes wrong with the production system, you can probably get away with running some subset of those VMs to maintain critical line-of-business applications. If that’s the case, you don’t need to maintain a duplicate virtualization farm. Instead, you can invest in a single four-CPU server with a bunch of RAM that can take the production load for some length of time.
This server wouldn’t necessarily play in the farm itself (though it could), but would instead be installed and ready to handle a load if the situation calls for it. In some cases, you may even be able to game the virtualization vendor’s evaluation period to avoid paying for licenses on a dormant server, but your mileage may vary.
The size of this emergency server should correspond to your infrastructure needs and the number and weight of the virtual machines you expect it to run. Generally speaking, you can get an awful lot of emergency processing power in a virtualized environment for under $10,000. Is that too much for peace of mind?
Bulletproof your infrastructure tip No. 5: Learn Linux
Even if you’re a Windows shop, learning enough about Linux can open up a huge number of valuable, low-cost options. You may not feel comfortable running critical business applications on Linux for whatever reason, but the plethora of open source network and systems monitoring and maintenance tools available on a Linux or Unix is incredible. There are Windows versions of many of these tools, but they are natively Unix-based.
I’ve been accused of being overbearing in my advocacy of full-scale monitoring and maintenance packages like Nagios, Zenoss, Cacti, RANCID, and so forth, but the truth is that these tools make an enormous difference in both day-to-day IT operation and in times of trouble. The benefit of learning Linux and running these tools is twofold: You gain Linux skills, and you enrich your network with a raft of supporting players that makes everyone’s life simpler.
It’s easier to preach about being proactive than to actually make these measures happen in the topsy-turvy, break-fix world of IT. But to paraphrase an old saying, if you’re too busy mopping the floor to turn off the faucet, you probably need to rethink your approach.
Who said? Paul Venezia said ;).