Deduplication and avoiding data redundancy

26 Mar 2019

When choosing a secure cloud backup provider, it’s worth ensuring they offer a suitable deduplication option. Yes, the clue’s in the name – this can help you avoid duplicating aspects of your data in your backup, thereby saving space, and potentially cutting costs.

For businesses and organisations that create and store large volumes of data where there are only small variations from one file to another, this can avoid what’s called ‘data redundancy’. In turn, this can free up significant storage space and reduce the time it takes to access backed up data. Sound good?

Let’s take a closer look…

What is data redundancy?

Data redundancy is what happens when aspects of data is repeated – or duplicated – across a database. For example, several files in a database might all have the same data in the same fields, such as a customer’s name or a product’s category, and it’s only the details or values in other fields of that file that distinguish it from some or all of the others. Of course, all data in a single file is important, but if precisely the same information appears in multiple files, it can quickly fill precious space in databases, and therefore in data backups.

Of course, you want all of your business-critical data in a backup. With data-deduplication you need only identify and securely back up what would otherwise be the same data repeated tens, hundreds or even thousands of times, once. All the ‘redundant’ duplicates are replaced with a simple reference to that data, called a ‘pointer’, which can potentially free up considerable disk, server, or cloud-based storage.

Confused? Here’s a simple example. Let’s say an email server contains 100 instances of the very same email attachment, which is 1MB – because 100 staff have backed up their email inbox. That’s 100MB of storage space for exactly the same file, so 99MB of that backed up data is effectively ‘redundant’ and taking up lots of space unnecessarily. Data deduplication identifies multiple instances of the exact same data, and deletes all but one. This can reduce demands on storage capacity and retrieval bandwidth significantly. In this example alone it saves 99MB, so the efficiency of scale is pretty impressive.

This means the more redundant data you currently have on your servers and in the backup cloud, the greater the efficiencies you can achieve with deduplication. It has become a highly popular alternative to compressing files to be backed up.

Why is deduplication important?

Dealing with data redundancy through deduplication helps you use storage and bandwidth far more efficiently, and ultimately require less storage space for the same amount of data. This can cut expenditure on physical storage (and the power needed to run and cool it) and cloud storage, even though the volume of data you’re creating is mostly likely growing.

Perhaps more importantly to business continuity, data deduplication can also make disaster recovery much faster, as there’s much less data to transfer – and as soon as it’s restored to systems and users, the data is ‘re-duplicated’. Referring back to the 1MB email attachment example above, in the event of an email system failure, everyone who originally received the email and its attachment would have it back faster, as deduplication meant the backup required only one ‘master’ copy.

Global deduplication vs local deduplication

What we’ve described so far is ‘local deduplication’. Global deduplication takes things a step further, basically applying the same principle to a series of onward backups, to multiple deduplication devices – cloud-based or physical. At each onward stage of deduplication, more redundant data is stripped out, which obviously keeps the space required to a minimum. It does mean however, that recovery of that data means going back through each stage of deduplication, restoring ‘redundant data’ as it goes.

Local deduplication, on the other hand, evaluates data redundancy per-device before the data is backed up, storing files in the cloud. So while global deduplication works for all devices that held the original data, local deduplication works according to each specific device’s data redundancy and therefore its own deduplication.

Because it works off a single deduplication index, global deduplication often has a better reduction rate. However, because the data is more easily accessible, local deduplication can result in better performance. Preferences will typically depend on the nature and volume of data a business creates, and how time-critical it is, but it can also be tied in with data governance and regulatory compliance.

Can deduplication save my company money?

As we’ve covered, deduplication can certainly reduce costs when it comes to cutting the physical hardware capacity required for backing up increasing volumes of data. It can also reduce costs on cloud backup services, but you should always ensure your cloud backup provider fully understands the practicalities and applications of deduplication. In short, you need to ensure that whatever approach you take, you’re applying the deduplication techniques that match the nature and volume of data you’re creating, and how quickly you’d need data restoring in full in the event of disaster recovery.

As this article has probably made you aware, it’s quite a simple proposition, but for deduplication to work best for your requirements, it’s probably best to discuss it with your backup provider. It’s certainly growing in popularity as an effective way to manage increasing volumes of data, and it’s well suited to cloud-based backup environments.

About BackupVault

Backup Vault provides fully automated, hassle-free, UK-based backup services to organisations all over the world – from small business to global brands, to public-sector clients and large corporate enterprises.