But in our discussion we agreed that bad experience might be as valuable as good experience in some ways. It might not be as fun, and you can certainly hit diminishing returns on it. But the intense memories of those bad experiences are the basis for many of our most valuable instincts and judgements.
Well, at the time, it felt kinda nice to be able to help him put a happy face on his obnoxious job.
And then just this week I had the tables turned on me: A server I depended on failed because a battery swelled, and then warped and took out the motherboard. Even better, the backups only appeared to be working, but really weren't. And that's not all: there was some code on that server that wasn't yet in version control because of chaos on the team at the time - after existing in limbo for two years it was slated to be added this week. We were down for days replacing that server. And I suppose we were lucky - it could have been worse.
All this got me thinking about my glib advice to my friend about the value of crappy experience. Was there a silver lining for me in all this? Did this week add to my mental catalogue of situations to avoid? Well, just as an exercise I decided to write down what the catalogue might look like. At least just the part that deals with backups & recoveries. Here it is:
- Monitor your backup logs: I've seen an operations team temporarily redirect backups to /dev/null during a change - and then forget to reset them after the change was complete. When the inevitable disk failure occurred there were no backups. If they had been reviewing backup logs, they would have noticed their backups completing in a few seconds rather than an hour.
- Test your backups: I've seen a recovery fail because the backup client was configured with the same 'node name' that was being used by another server. The logs reported that the backups were successful, but the backup service actually was dropping them. This could have been detected if separate reports of the backups were reviewed or periodically tested.
- Outsource tasks but not responsibility: I've seen an team with outsourced database administration discover that their data wasn't being backed up - after they lost it all during a critical storage failure. They should have been getting weekly status reports and auditing their environment.
- Don't confuse RAID with backups: I've seen data lost because the team thought backups weren't necessary since they had RAID storage. They didn't understand that RAID doesn't protect from human error, malicous intent, software defects and some hardware errors.
- You can lose 2 drives in a RAID array in a single day: I've seen an array lose two drives in the same day, and in the same week so don't wait to replace a bad drive. Definitely don't wait a week.
- Recoveries need to be done by the senior guy: A critical system recovery is the wrong time to test the knowledge or train your junior staff. I've seen a very critical Fortune-50 business system with a high-availability SLA go down for 96 hours when their recovery was handled by junior staff. Oh yeah, and this is when you will be sweating bullets over whether or not out-sourcing to the lowest-cost provider was a great idea.
- Consider backups in system design: I've seen developers design and build very complex software only to fail to consider the concurrency, performance, capacity and costs of backups. I've actually seen this a lot. And they often discovered that their system couldn't work once they started backing up the data.
- Recoveries may not restore a consistent state: I've seen recoveries fail because the developers didn't realize their backup was not an instantaneous snapshot of their file system. When recovery restored the file system in an inconsistent state their application couldn't run.
- Lost in translation: I've seen recoveries fail because the original backup instructions provided to a separate backup had to be entered into a separate system that the developers couldn't see. And the backup team got it wrong.
- Be careful with shared backup servers: I've seen databases bog down in performance because they were using an under-sized shared backup system. I've seen these performance issues make a recovery take days because it would keep failing when it was 16 or 24 hours into the recovery. I've also seen databases refuse to accept writes because they ran out of log space - because the backup server was too slow to accept the logs.
- HA may be LA: I've seen HA systems fail because of human error. I've spoke with HA vendor experts that confided to me that their HA systems had lower availability than their non-HA systems - due to human errors associated with their more complex administration.
- Application-level backups aren't as reliable as database backups: some DBAs will configure a database backup to export specific tables rather than rely on the database backup utility. There can be good reasons for this - like non-logged writes to some tables. However, I've seen these backups fail when additional tables were added to the database but not to the backup process.
- Prune backups by age AND versions: years ago I worked in a fifty-developer shop that lost ten years of all source code it produced when its version control storage failed - and only then discovered that its backups had been failing. They couldn't go back to older backups because they had been pruned based on their age. Now, when I implement pruning solutions to backups I always make sure that it never touches the X most recent versions.
Well, hopefully I won't be adding to this list any time soon.