ExoticSilicon.com - Jay discusses backup strategies

There's a good chance that the machine you are currently using to read this webpage contains a lot of your own valuable work. Whether that's photos or videos or even just plain old boring work stuff, you'd probably be lost if suddenly it wasn't there.

Have you backed it up? If not, shame on you! But for those who smugly answered, "yes!", just how sure are you of your backup strategy? Have you ever needed to restore anything? Have you tested it? How quickly can you get back up and running after disaster inevitably strikes? Have you covered all bases? What about slow, silent data corruption over the long term?

Hummm, see? It's not as simple and straightforward as it might seem. So let's dive right in and discuss the do's and don'ts of backup strategies.

The myth...

When thinking about data backup, many people have tended to fixate on the possibility of a crashed hard disk, and in modern times, a totally dead SSD. It's been the classic disaster scenario for decades, assuming that your office doesn't burn down overnight. You sit down in front of your desktop in the morning, and it won't boot. As you reach in to fiddle with SATA cables and clean connections, you realise that the disk isn't even spinning up.

Maybe you knew enough to try a couple of short, sharp, ninety degree twists in the plane of the platters, in case it was caused by stiction. But sooner or later, reality dawns, and it becomes clear that the disk will never spin again. It, along with your data, is gone forever.

So a couple of full back-ups at regular intervals should suffice, right?

Except that isn't how it usually happens - most likely you'll be calling on your backups for some other reason.

The reality...

Aside from the fact that when modern SSDs fail they often remain readable, I.E. they become read-only, your data is much more likely to be at risk from silent corruption over time or overwritten due to operator error.

Silent corruption can happen for reasons ranging from bad SATA cables and buggy SSD firmware, to malware and more. Operator error might go genuinely un-noticed, or be covered up.

Both of these scenarios can be protected against with an adequate backup strategy, but the simple approach of a regular, full backup, (which also often goes untested), in many cases just won't suffice.

Pesky backup details:

The amount of work that is not backed up at any one time, I.E. data that has been created since the last backup.
How far back our backups go. A lot of data from current work-in-progress is constantly in flux and changing, but plenty more usually remains little used and unchanged over a long period of time once it's immediate use is over. If this data is later found to be corrupted, including on recent backups, then a backup from a year ago might be perfectly useful. The question then becomes, will you have kept one?
Speed of recovery. Downtime at home is inconvenient. Downtime at work is lost productivity, and lost income. Just how much time can you afford to lose getting back up and running?
Security of backups. Confidential business information needs to be kept away from prying eyes, whether it's on a server's hard disk, or a tape in a safe deposit box. Without sufficient consideration, your backups might be an easy target for an attacker.

We need to address all of these criteria, otherwise our backups could end up being useless.

Why RAID won't save you, (and might just make things worse)

RAID is designed to prevent against a very specific failure mode - a single dead hard disk. It should already be obvious by now that this is not the main scenario that concerns us. RAID won't help against data loss through operator error - the data will simply be dutifully overwritten on all of the disks in the array instead of just one. Likewise with the vast majority of data corruption cases. Although RAID can in theory be set up to verify data integrity by reading multiple copies across different disks in a mirror, and therefore potentially spot data corruption coming from an individual drive, in practice this is very rarely done.

In fact, we can achieve a similar level of integrity checking at the application level, anyway, without using RAID. Additionally, if a drive is spewing out bad data, surely we want to know about it and fix the fault, rather than have it mitigated by a RAID controller with a blinking light that is easily missed or ignored?

What is perhaps much less obvious, is that installing a RAID in place of a single disk actually introduces new failure modes, and always increases the overall possibility of a fault occurring.

As well as the very real possibility of bugs in the firmware of the RAID card, simple maths tells us that with two disks in a mirror, there is, nominally, twice as much risk of any one of the drives failing at any particular moment. Although we might hope that a failed disk would simply be detected as such, and that the system would continue running, there is usually no guarantee of exactly how a dead disk will behave. It's entirely possible for it to hang the RAID controller and crash the system, for example. If the fault is power-supply related, all sorts of bad things could happen - including the loss of every single drive in the system.

What you are really doing with RAID, is trading the risk of a single disk failing, (which used to be much more of an issue in the distant past), for higher complexity and more potential points of failure overall. There is no free lunch - every gain comes with a corresponding loss.

RAID mirrors and backups serve very different purposes. One does not substitute for the other.

An insurmountable task?

Let's review the first consideration in the list above:

The amount of work that is not backed up at any one time, I.E. data that has been created since the last backup.

In case it's not already clear, let's take an example. Imagine that we back up our entire server every Friday evening. If disaster strikes on Friday morning, then any data from the last six days won't be on the most recent backup.

Clearly, to minimise this concern, we want to make backups as frequently as possible. But this comes at a cost:

Operator time taken to actually perform the backup
Reduced server performance during the backup
The need to physically store more backups to cover the same time-period

Whilst we can reduce the burden somewhat by doing incremental backups, (I.E. only copying new data, or data that has changed), this adds extra administrative work, and complicates the restoration process. Again, everything is a trade-off.

Reducing the backup burden by archiving

What can reduce that burden is to identify data that is long-term non-changing, isolate it, and deal with it separately. In other words, take it right out of the regular backup loop. This is the concept of archiving.

Or to look at it another way, we stop treating our, (main), hard disk as an endless cesspit continuously filling up with more and more unsorted data. Instead, it's the home to work in progress. Once that work has reached it's conclusion and won't change anymore, it's moved elsewhere for future quick access, backed up once, and then mostly just left alone.

In practice, this doesn't have to be a grandiose arrangement - 'moved elsewhere' might just mean a different partition on the same disk, and the archival backup can be to a regular tape or optical storage.

In some cases, where an old collection of data is unlikely to be needed at short notice - or even at all - and disk space is tight, the archival backup, (or preferably several), might be sufficient with no need to keep a copy on a local hard disk at all.

This concept of archiving unchanging data and taking it out of the regular backup loop also has two more advantages, one for data security and another for long-term data integrity:

Considering data security, the archived data can be encrypted with a separate key to the main hard disks. This key will then only be required on occasions that the old data needs to be accessed.

With regards to long-term data integrity - if the data shouldn't be changing, then by archiving it at a fixed point in time and keeping that copy, rather than continually re-writing it to each successive backup, we afford ourselves a degree of protection against silent data corruption. This could be in the form of malware, (typically ransomware), which is secretly encrypting our files, or more mundanely, the result of deteriorating memory cells in an SSD, or even buggy SSD firmware corrupting data as it moves it around for wear-leveling purposes. Whatever the cause, our long-time-ago copy isn't vulnerable to having had it's good data unknowingly overwritten with bad.

Tiered backups to the rescue

The upshot of all this is that to all practical intents and purposes we can't rely solely on a single backup schedule for all of our data.

The solution is tiered backups. Copies of the most important and most changeable data can be kept to hand on local storage for quick access in case of a fat-fingered deletion, whilst our more comprehensive backup is there to guard against possible tragic occurrences such as fire or physical theft.

Speed of restoration is often overlooked as a factor, but it's importance shouldn't be underestimated. If you accidentally break a configuration file or lose an important email that's just arrived, you'll probably want to be back up and running as quickly as possible. That's going to be much easier from a local disk than it would be restoring from a tape or other off-line backup.

A typical comprehensive backup strategy:

Tier	Category of data	Backup media	Location	Speed of restoration	Protects against	Considerations
Tier 1	Finished projects Older archive material Old log files Previous years' financial documents Old company reports	One or more of: Optical WORM Tape HDD off-line in cold storage	Preferably off-site. On-site only if off-site is not an option, or as additional copies.	Slow if off-site Intermediate if on-site	Complete melt-down: Fire Flood Successful large-scale network intrusion (*) Large scale damage-causing electrical system failure Silent data corruption due to malware or bad hardware	Beware of format obsolescence. Non-archival copies of the same data should be kept locally for quick access.
Tier 2	All current data. Everything that is not archived in the category above.	One or more of: External HDD Tape Optical WORM or re-writable	Preferably off-site and on-site copies	Intermediate	General system failures: Dead hard disks and SSDs Operator error Silent data corruption that is caught early Smaller scale less sophisticated network intrusion (*)	These are the traditional 'once-a-week' full backups, although obviously the interval can be anything you like
Tier 3	Current work in progress. Excludes system files and anything in the next category below.	One of: Flash drive External HDD Optical re-writable	On-site. If security policy permits, taken home	Fast	Individual workstation issues: Failure of a single personal workstation Operator error - deletion or overwriting	This is basically a personal 'my home directory at the end of the day' backup. Much faster and easier to restore than a 'weekly', but may not provide as much protection against un-noticed operator error or silent data corruption.
Tier 4	Fast changing or new and irreplaceable data: Mailspools Phone call recordings Documents about to undergo a large change Collections of data about to be re-organised Data from sensors or data collection equipment	One of: Local hard disk Local network server	On-site, local	Very fast	Individual workstation issues: Operator error Sudden HDD or SSD failure Power outage	This data is hard to protect in any other way due to it's changing nature. Backup should be made quick and easy to perform - a single command to be run as often as seen fit. The system should keep the last three or four such backups and then overwrite the oldest automatically. Backup can be made to a different partition on the same disk, if there is no other option. RAID also offers some protection again sudden disk failure.
Tier 5	System data: Configuration files Important log files	One or more of: Flash drive HDD or SDD on another local machine	On-site, local	Very fast	Individual server failures: Operator error Sudden HDD or SSD failure	Copies should be made whenever configuration files are changed, and whenever log files are deemed to have a particular importance. Recommended to keep old versions of configuration files until they have been copied to the regular 'weekly' backups.

(*) Note that in the case of a successful network intrusion, ("hack"), the availability of backups only protects against data loss. Exposure of confidential company data to third parties is a separate concern which is outside the scope of this article.

Where's the cloud?

You'll have noticed that we haven't mentioned 'the cloud' as a backup medium at any point so far. This is for very good reason.

Whilst various internet service providers would like to convince you otherwise, there is nothing magical about storing data 'in the cloud', all you are doing is uploading it to somebody else's computer. As such, cloud storage can indeed serve for any of the backup categories described above, and could - in theory - be your only backup medium. But is that what you really want?

Firstly, you completely lose physical control over how and where your backups are stored. You place yourself at the mercy of whoever is keeping your data, if and when you ever need to get it back again. Everything will be done on their schedule, and at their agreed price.

Secondly, do you trust your chosen provider to follow appropriate security policies, and keep your data away from third parties?

Thirdly, can you be absolutely sure that a copy of your data will really be available when you need it? Other people have system failures too, and if it's not their data at risk, will they even care?

In any case, cloud backup is not a category in and of itself. It's merely one tool at our disposal, and not necessarily the best one. And if you're in the unfortunate situation of not only considering placing your backups in the cloud, but also have the main and only copy of your data also stored on somebody else's disks, then you should be concerned. Very concerned.

Identifying the material

As we've already discussed above, reducing the total volume of data in our regular backup pool is key to making the whole task as efficient as possible. But just how do we identify which data is a good candidate for archiving?

Depending on your particular business workflow, this can range from trivial to moderately time consuming. If you're lucky enough to generally work on a single project until it's completion and then move on to the next, then you essentially already have your answer. Once the project has been been handed to the client, or reached it's 'time of transmission', then it should be fixed in stone at that moment. If an updated version ever needs to be produced, either to add new material or correct errors, the process to do that would be to create a new project using the the files from the original 'finished version' as a base. This in turn would eventually be archived once no further changes were made to it for an extended period of time.

On the other hand, if like most organisations you have multiple projects on the go at any one time then some diligence is required to remember to archive each one as it reaches it's completion.

Of course, in reality it might be desirable to wait for a set period of time after the apparent closure of one work flow before finally committing it to a permanent tier 1 backup, as in the real world unexpected issues and requests for changes do occur.

An alternative method is to simply do a search once every year for everything that hasn't changed in the previous 18 months. For example, you might do this during the first week of January, in which case your Tier 1 backups can be conveniently indexed by year.

Performing the backup

Essentially, this process is a fairly straightforward one-off job for each dataset. By definition, we're not, (or no longer), backing up this data as part of a regular schedule.

Considering optical?

If you haven't already read Crystal's article about using optical storage on OpenBSD, now might be a good time.

There you can learn a simple technique for doing encrypted backups to blu-ray disc, with no additional software required over the base system.

What does require some thought and planning, though, is organising the data before committing it to it's, (hopefully), final resting place. Although just copying the files and nothing else should in theory be sufficient, there are some extra steps well worth considering:

We strongly suggest that you checksum each file and record this index of checksums alongside the files it references on the same media. This should ensure that any errors that occur when reading the data back in the future don't go un-noticed. Alternatively, you could digitally sign the files. Don't assume that any possible read errors will be caught at the drive firmware level. We have been bad data read back as good from various media more times that we care to remember.

Additionally, place a text file containing notes explaining the contents of each disc, including who prepared it, when, and on what equipment. Recording this information along with the data itself ensures that it's always available to anyone looking at it at a later date.

As this data is hopefully being written once, for access at any time into the distant future, we need to consider file format obsolescence. Basically this means avoiding the scenario where the data can be read back perfectly without errors, but is useless as there is no current software capable of interpreting it's contents. Wherever possible, use open and well documented file formats, and for less common formats, include the format specification or source code for software capable of reading it. Whilst the source code may no longer compile automatically on a later system, somebody, (such as Exotic Silicon's commercial division!), will almost certainly be able to make it work or re-implement it.

You might also consider using different brands of media for multiple copies of the same data. New products are continuously entering the market, and old products can have their formulations changed and updated, (but not necessarily improved). Certain batches of media might turn out to be sub-standard.

The only thing you can be sure of is that unless you've already been testing the archival qualities of a particular brand, or have some trustworthy data about them, you'll likely only find out how good your choice was in many years to come when you need to read your data back. At that point, it's probably too late.

Long term media storage

Hopefully you'll rarely - if ever - need to access the archived copies of tier one data. As mentioned above, unless the data is truly historic and unlikely to ever be needed for reference, it's recommended to keep a copy of it on a separate local hard disk or on a local server for easy reference. Obviously this disk doesn't ever need to be backed up, as it's contents can be replaced from the tier one archives.

However, this doesn't mean we should just dump our neatly labelled set of optical discs in a damp and dusty basement and never look at them again. If you did ever need them, it would be rather disappointing to say the least if they turned out to be unreadable.

The first step is to make sure that the set of media, be it discs or tapes, is well labelled. This not only helps to ensure that you can find any particular dataset quickly, but also reduces handling and potential damage of the media. Catalogue each individual tape or disc with a unique serial number, and note it's location. This is especially important if you are keeping backups off-site.

The final storage location for your archives should be appropriate for the type of media you're using. Where possible, aim to keep temperature and humidity constant and within the specifications stated by the media manufacturer. Magnetic disks and tapes should obviously be kept away from magnetic fields. Optical media should be kept away from exposure to direct sunlight.

Infrequent but regular maintenance might also be required depending on the nature of the media, as we will discuss in the next section.

Regular maintenance

Check your archived data is still readable every few years! At the same time, perform any regular maintenance that is appropriate for the type of media:

Optical discs	Usually optical media doesn't require any special attention, but be careful not to write on the data surface of low quality media that doesn't have a protective plastic layer with regular marker pens. If in doubt, write just a serial number in the hub area and put any further details on a paper insert.
Tapes	Tape should be re-tensioned from time to time, perhaps bi-annually. Traditionally, this was done to minimise print-through of the magnetic recording from one layer to adjacent layers and also to avoid mechanical stress on any one part of the tape. These factors are much less of a concern with modern tape formats, however we still recommend re-tensioning any such media periodically as the tape itself can deteriorate and shed chemical binders, (basically glue), to adjacent tape layers. Re-tensioning won't prevent this, but might help to reduce it's effect on any one part of the tape. Older cartridge formats may contain plastic guides, pinch rollers and other mechanical elements that can become stuck to the tape or cause it to deform. Always allow tape-based media to acclimatise to the temperature and humidity of the environment before trying to read it.
Magnetic disks	Hard disks in cold storage should be powered up and allowed to spin at infrequent intervals to avoid problems with static friction, (stiction). As with tapes, if you're literally moving them from a low temperature environment in to a warm office, allow them to acclimatise for several hours before applying power. It's usually enough just to connect such disks to a power supply with no need for a data connection, unless you actually want to verify the content, (which can possibly be done less frequently). Before concluding that any disks which don't spin up are faulty, check to see if they have been configured to power up in standby, and therefore require a start unit command to be sent from the host.

If your routine inspections reveal signs of deteriorating media, format obsolescence or storage shortcomings, then address these issues without delay.

Performing the backup

How you actually go about physically doing the backup will depend to a large extent on your work environment. If you're a small business with half a dozen or so workstations in a single office that store their data locally, it might be practical to walk around with an external hard disk. Otherwise you'll probably be backing up from a central on-site LAN server.

Far less preparation of the data is required here than it was at tier one. This is because these backups will mostly be obsolete within a few months, so the considerations about file format obsolescence don't exist and neither is it so important to keep detailed records of the contents and who prepared it beyond a simple time and date stamp. We still recommend checksumming all of the files, though, as the risk of reading bad data as good is still a threat here.

Backup media rotation

The chances are that you will be using re-writable media for tier two backups. This helps to reduce costs as well as avoid wastage and the bureaucracy of disposal of WORM media. But at the same time it creates two questions:

How many times should we re-use the same media?
How do we rotate the media we're using after each backup?

It goes without saying that any media, be it tape or disc, even solid state, has a finite number of possible re-write cycles. External hard disks probably fare the best here, although beware of the ridiculously short life span of micro-USB connectors. Putting a real SATA drive in a USB enclosure yourself rather than buying a real external HDD mitigates this problem, as the enclosure can always be swapped for another one or the drive connected directly internally.

In general, we wouldn't advise intentionally using re-writable optical discs for more than ten to fifteen passes, possibly fewer. Although the media may be rated for many more cycles than this, in reality even careful handling is likely to cause dust and scratches to accumulate on the recording surface. By planning to retire discs from the backup cycle at this point, if you do need to continue using them unexpectedly because the new batch was faulty or never arrived, then they should still have some life left in them.

For tape, despite manufacturer's claims of re-usability, we would be reluctant to trust any particular piece of media beyond twenty or so passes. In reality, they probably will last much longer, but at the same time they are probably not being written to by a tape drive that is in perfect alignment, with perfectly clean and demagnetised heads. Nor will they be stored in ideal conditions and free from mis-handling. In short, there are a lot of environmental factors just waiting to reduce that claimed longevity by an order of magnitude, so considering the current price of media, why take chances? Just like with optical discs, if we plan to retire tapes whilst they are still within their usable lifespan we have some time in reserve if new stock fails to materialise.

In any case, be sure to mark the number of times each piece of media has been written to as you go along, and obviously discard media that shows physical signs of damage or excess wear, or that produces errors upon verification.

Speaking of verification, as always, be sure to verify these backups after writing them.

Regarding media rotation, the simplest strategy is a plain round robin approach. So you use tape A the first week, tape B the second week, tape C the third week, and then back to tape A for the fourth week.

This works, and in same cases where it's deemed sufficient it might be the best approach due to it's simplicity. However, other more comprehensive options do exist.

One possible alternative using the same three tapes would be to use just two of them alternately for all of the regularly scheduled backups, except the first backup of every other month which is always done to the C tape. This has the advantage of keeping older backups to hand - potentially useful protection against silent data corruption - at the disadvantage of a potentially larger interval if the A and B tapes ever both fail.

As an extension to this scheme, the C tape can be retired after six passes - one year of use - and kept indefinitely. This preserves a snapshot of your work in progress that might be useful if you ever need to look back over a longer timescale than anticipated at data that has not yet been committed to a tier one archival backup.

If you're using optical disc rather than tape, you can simply make each sixth monthly disc a WORM disc, and keep that independently of the regular C disc that is in rotation.

Check your restoration procedure actually works

It might seem like a stupid question, but do you actually know how to restore the data from one of your backups if you ever need it?

Almost certainly, you'll run the backup procedure a order of magnitude more often than the restoration procedure. That means the restore process won't be as familiar to you. Mistakes can happen - you don't want to accidentally over-write the exact tape that you wanted to restore!

Operator error aside, it's fair to say that if you verified your last backup when you made it then your data almost certainly exists there. So this question really becomes, how quickly can we get back up and running again?

In business, time is money. Having your system down for an hour is an inconvenience. Having it down for a day could be more serious.

At the very least, consider doing a partial test restoration of your data every time you change your backup procedure, or upgrade the hardware or software.

To back up the entire OS, or just your own data?

Here at Exotic Silicon, we almost always suggest that you concentrate on backing up your own data and any essential configuration files, whilst leaving the operating system to be installed afresh.

Why? Backing up a running OS is awkward enough, but the real difficulty tends to come with the restoration. You'll usually need to install a minimal system anyway to even access your backups, at which point you've then got to overwrite parts of that running system whilst performing the restore. Any difference in versions between the new minimal installation and what you are restoring could lead to problems, but might be inevitable if you're restoring to different hardware - for example if a power surge took out not only your main system disk but also the motherboard as well.

Installing a BSD system from scratch is usually very quick and easy. The time consuming part is the configuration, but as long as you have all of your old configuration files to hand, re-installing any software packages and merging the old configuration with the defaults for whatever new version of the system you've installed, should be fairly straightforward.

Obviously it pays to make notes about any non-standard or obscure local modifications precisely for such future reference.

Additionally, whilst restoring a complete and working set of system binaries from backup might be fine in the case of a disk failure, for example, it's far less desirable when restoring from backup due to malware or network intrusion. In that case, it's necessary to be absolutely sure that either the backup pre-dates the attack, or alternately to verify every single binary against a known good source, to avoid re-introducing unauthorised software to the system.

Disk image backups

What? Nobody has used whole-disk image backups for two decades!

True. For younger readers who were not around in the 1980s, a quick explanation might be in order. In that era, when backing up a hard disk we had a choice between making an image backup where the entire disk was copied sector by sector with no consideration of the underlying filesystem, or a file by file backup, which copied each file in the traditional way. Image backup has long fallen out of fashion and is rarely used nowadays. This is in most cases a good thing, because although the backup procedure was fairly simple and fast, it was inefficient, (copying even unused areas of the disk), and anything but a full restoration to the exact same make and model of disk was usually something of a challenge.

So why are we even talking about image backups, then? Because BSD systems do make this kind of backup particularly easy. You can boot into single user mode with all of the filesystems mounted read-only, (or some even unmounted), and copy the entire disk to tape or another raw device. Restoration is also trivial, as you can boot into a very minimal installation from a USB flash drive or CD, and restore with no additional software required.

There may be cases where this is a quick and easy way of backing up an embedded system. For example, the single board computers we used in our SBC bootcamp article have a removable eMMC chip as their main local storage. That can easily be removed, placed in a USB reader, and imaged to a file on a desktop machine in a matter of minutes.

Scope of this tier

Unlike the preceeding tiers for which the system administrator is responsible, backups at this level become the responsibility of the individual workstation user.

Tier three is essentially a casual copy of the user's home directory at the end of the working day. In many cases this can be achieved with a simple shell script copying to a usb flash drive.

These backups are mainly intended to allow for quick recovery of a fairly recent version of any files that are accidentally deleted or overwritten. However in the event of a more serious system failure requiring the restoration of tier two or even tier one backups, the tier three backups may provide a newer version of files that were recently modified.

Additionally, in the event of a serious disaster such as fire or flood where your normal IT facilities cannot be immediately restored, tier three backups may allow users to do at least some useful work remotely from personal laptops.

Scope of this tier

This is really intended as a convenience to the system administrator. It's effectively the equivalent of tiers three and four, but for important data specific to any servers they are responsible for.

Server configuration files should be backed up on the regular tier two backups anyway, but since they are usually very small, it's trivial to keep copies of them on a flash drive or another local machine. The backup could even be made to an archive in the administrator's personal home directory.

Having such a readily available copy to hand avoids the delay and inconvenience of fetching it from the tier two backups if a server needs to be re-installed, or if a configuration change is found to break something a few hours or days after it was made.

A shell script that archives all of the configuration files from all of the local servers in to a single timestamped tar file in the administrator's home directory, and automatically deletes the oldest copy when they exceed a certain number, is a reasonable way to implement this tier.

At the end of the day, only you can answer this question, based on the value you put on your company's digital assets.

The comprehensive backup strategy we've outlined above is very similar to what we use internally at Exotic Silicon, and our archives still have plenty of data that we've preserved since the 1980s and 1990s. We've been keeping data safe since long before cloud storage was a thing, and along the way we've dealt with many a faulty tape drive, broken pins on IDC connectors, fried disk controllers, and deteriorating optical discs, as well as other challenges. Yet our old data is still intact, because we make a point of adopting a belt-and-braces attitude towards it's safe storage.

If you simply don't have the resources to do all of this, it's possible to get reasonable protection by just doing what we described above as the tier two backups. This should be considered as the minimum necessary to be effective.

Take note!

Expensive hardware not required!

You absolutely don't need special 'encrypted' usb flash drives, or a high-end tape streamer to keep your data safe.

A regular blu-ray disc recorder available for under $ 100 or 100 euros will allow you to copy almost 50 Gb of data to inexpensive but reliable write-once media in about an hour.

Whilst very cheap, low performance flash drives are best avoided for the tier three through five backups, moderately priced units are widely available that can sustain 20 Mb/second when writing. Any encryption required can be trivially done on the host, so an ordinary flash drive is all that is required.

Of course, more advanced hardware makes a complex backup schedule faster and more convenient. However in modern times, cost really shouldn't be a barrier to entry for any small business, or even a serious home user.

Restoration procedures

We've spoken in detail about performing backups, and how the procedures described in this article should offer strong protection against many types of data loss. But what exactly do we do if the worst happens? And how does this complicated multi-tier schedule actually protect us from the dreaded encrypting malware or a subtly faulty disk controller that's been corrupting our data over a long period of time?

Much depends on the exact situation that has caused you to need to revert to your backups in the first place. If you are recovering from a sudden data loss due to hardware failure, such as one or more disks that have just died, your considerations will be somewhat different compared to being in a scenario of having just discovered long term data corruption. In the former, you can usually expect all of your backups to contain valid data and will be looking to restore the most recent copies. In the latter, you'll need to identify which files need to be pulled from older backups, made before those files were damaged.

In any case, expect to be busy - the system administrator's work begins whilst the smoke is still rising. Take your time, and don't rush. Mistakes here will be costly, and the time required to physically read back the data is likely to far outweigh any interactive operator time you could save.

Before doing anything with the existing system, check who is responsible for assessing the damage and should ensure that all policy requirements have been met regarding investigation of the incident and possible insurance claims.

Restoration after disk failure, or significant loss and damage to hardware

This section is only intended as a broad outline of the steps you would need to take.

If only a single disk has failed:

Identify exactly what data has been lost and determine whether copies exist on tier 1 backups, tier 2 backups, or a mixture of both.
Retrieve the most recent copies of your tier two backups, and tier one backups as needed.
If there is any doubt about the equipment being used to read the backups, test it first with blank or expendable media.
If the failed disk was a system or boot disk, first re-install the system software. You should have up to the minute copies of all necessary configuration files to hand on your tier 5 backups, but if not they should also be replicated as tier 2 copies.

If there is widespread hardware damage requiring replacement:

Secure access to your, (hopefully off-site), tier one backups, and locate the most recent copies of your tier two backups.
Source equipment to read them. This shouldn't be difficult unless you've been using a format that has long passed into obsolescence.
Test the above equipment before loading backup tapes or connecting hard disks containing your data!
Once you have access to all of the necessary replacement equipment, begin by re-installing the system software. You should have up to the minute copies of all necessary configuration files to hand on your tier 5 backups, but if not they should also be replicated as tier 2 copies.
Note: If you cannot start with this step due to a delay in receiving all of the necessary replacement equipment, check whether you can at least make use of the otherwise wasted time to copy your tier 1 backups from their backup media which is likely slow to access, to new hard disks or SSDs ready for installation in a new server. Otherwise, leave this step for later.

In both cases, continue with:

Restore user data from the most recent tier two backups.
Supplement this data with any more recent versions that are available on individual tier 3 backups.
Test the system.
At this point, most normal operations that don't depend on tier 1 archived data should be able to begin.
Restore tier 1 data if not already done.
Back up the newly installed system to a new set of tier 2 media. This is especially important if any configuration changes or upgrades were done at the same time as the restoration.
Retain any tier 2 backups that you restored from for a period of time, (such as six months), in case you later discover that errors were made during the restoration process, or the system fails again soon after the restoration.

The exact procedure followed will always, of course, depend on the individual situation.

Restoration after discovering long-term data corruption:

This section is only intended as a broad outline of the steps you would need to take.

This scenario requires a somewhat different approach, as we can't automatically guarantee the integrity of the backed up data. Naturally, if the data was already bad before being backed up then no error will be thrown during the restoration - the data coming off of the backup medium is intact as far as the drive and software is concerned, it just doesn't contain what you thought it would.

If the corruption was caused by hardware failure:

Ensure that the faulty hardware has been physically replaced, and the new hardware tested.
If system or boot disks have been affected, then consider re-installing the system software.
Although it might be possible to fix a damaged installation, an incomplete repair carries the risk of problems surfacing in the future that might be difficult to resolve. You should have up to the minute copies of all necessary configuration files to hand on your tier 5 backups, but check that these are not corrupted and if they are then retrieve the same from tier 2 copies, or re-create them.

If the corruption was caused by malware or network intrusion:

Ensure that the source of corruption has been identified and measures put in place to prevent a re-occurrence.
Re-install the system software from scratch.
This should be considered a mandatory step. Never try to 'clean-up' such an infected system, as it is extremely difficult to guarantee that you have removed all traces of unauthorised software.
Do not restore any binaries from your backups unless you have checked them against known good checksums or signatures.
Any source code that is restored from your backups should also be checked against known good checksums or signatures, or otherwise verified that it has not been modified.
Check carefully any configuration files restored from your tier 5 backups. As these will likely have been some of the most recently made copies, they are particularly at risk of having been unexpectedly modified. If in doubt, re-create them.

In both cases continue with:

Restore as much data as possible from tier 1 backups.
These should pre-date the start of the data corruption, unless it's been going on undetected for a very long time, but if there is any doubt then do appropriate data integrity checks.
Restore the most recent tier 2 backup, and check for files which are corrupted.
If there are any files which are corrupted, check the copies of these on the next oldest available tier 2 backup
Repeat this process until known good versions of all files are found, or no older tier 2 backups are available
Check very carefully any more recent copies of files on tier 3 backups, before restoring them in favour of the known good copies from tier 2
Back up the newly installed system to a new set of tier 2 media.
Retain any tier 2 backups that you restored from for a period of time, (such as six months), in case you later discover that errors were made during the restoration process, or the system fails again soon after the restoration.

Always clearly mark any backups that are suspected to contain malware in any form, and if they are not needed for later analysis then destroy them.

Closing comments

We've discussed the importance of having an adequate backup schedule in place, how to reduce the size of the task by archiving unchanging data, presented a typical example of a comprehensive backup plan, and also touched on how to correctly store, label, and handle the backup media. We've seen that different approaches offer varying levels of protection against the diverse range of possible data loss scenarios, and talked a bit about how to approach the task of restoring.

If implemented correctly, with a high level of compliance, the five-tier backup schedule detailed in this article will not only protect your data from the vast majority of threats it faces, but also allow fast and reliable recovery in the event that disaster strikes.

Here's hoping that your backups are just as readable in 50 years time as our trusty 9-track tape!