Resilience of research data#

Any number of mishaps can occur that put research data at risk. The degree in which your data is able to survive these is called resilience. Obviously, you would like your data to be resilient to as many different types of mishap as possible. Also obviously, this resilience usually comes at a cost (either financial or otherwise) so often a choice is involved.

Data resilience can be defined in a number of dimensions. The most obvious two are existence (your data ceases to exist and the only way to recreate it is somehow reproduce the same or similar data) and availability (your data still exists, but is no longer accessible in a timely fashion). Another dimension that is often of interest is integrity: has anyone tampered with your data?

In order to weigh costs to resilience it helps to classify the different types of data resilience involved in storing your research data.

Disaster resilience#

Disaster resilience indicates the ability of your data to survive major events that destroy infrastructure. A typical example of a disaster could be your data center burning down to the ground, ending the existence of your data. Disasters are often geographically centered, so the obvious way to achieve resilience is by creating geographically separated copies of your data. Disasters can also impact the availability of your data: data-connections between your location and your data’s location may be unavailable, rendering it impossible to retrieve the data.

Failure resilience#

Failure resilience deals with the failure of one or a number of components of your data storage solution. A very typical example would be a hard disk failing. If that hard disk is the only medium holding your data it may be impossible (or extremely expensive) to salvage the data from it. The most widely employed method of achieving failure resilience is by implementing redundancy on storage devices. That means your data is distributed in some form over a number of devices where checksum data is kept to accommodate for failure of a specified percentage of these devices. Failures can impact your data’s existence or availability, but also your data’s integrity (the nasty type of errors that don’t show themselves but do corrupt the contents of your files).

Administrator error resilience#

Many storage solutions are not managed by the owner of the data due to the technical complexity and amount of work involved. Professional staff is assigned the task of doing this. All staff (or at least the vast majority) is human, so these administrators can make mistakes and potentially delete your data. Professional storage providers always have procedures in place to minimize this type of error from occurring, but it always can happen. Resilience to this type of error is achieved by copying your data to a repository that is administered by a different administrator (usually using a different storage provider guarantees this).

User error resilience#

You, as owner of the data, often need to work on a daily basis with that data. This often involves moving stuff around, restructuring, copying and sometimes deleting. Or, if you are doing more or less advanced stuff, having automated scripts do these actions for you. What will happen one day (and to most of us it already has happened) is that you press delete, and then regret this the very next second. Experience shows that there is no way to avoid this from happening, so in order to be resilient, your storage solution needs to provide you with an option to recover from this event.

Hacking resilience#

Think of: a hacker encrypting all your files and then demanding money to decrypt them.

Resilience of various storage scenarios#

Threat

Disaster

Failure

Admin error

User error

Hacking

iRods

U: drive

U: drive + weekly copy to Surfdrive

Laptop

Laptop with weekly copy to USB