Backup Considerations and Tips
On this page, we provide information on a few considerations and some tips for users to manage backups of their data. Users are referred to the pages on the individual backup tools for tool-specific information.
On this page
The table of contents requires JavaScript to load.
Determine where data should be backed up
- It is recommended, for important data, to have at least 3 copies in at least 2 locations
- Having multiple copies helps protect against hardware and software failures, while having copies in different locations helps protect against data loss due to natural disasters, infrastructure failures, or other large-scale events
- Note that all CHPC storage options are currently in the same physical location; for data resilience, we recommend researchers back up to another location (e.g., other institution, cloud, etc.)
Determine who should do the backup, keeping in mind the ownership of the backup
- The CHPC offers automatic backups with group space purchases for an additional cost
- Also keep in mind that the user making the backup must have access to the data, so make sure all group members set their Linux file permissions appropriately
Determine which data need to be backed up
- Do you need the full content of your group space backed up?
- Is there only a subset of the data that needs to be backed up?
- If you determine that only a subset of the data needs to be backed up, is it possible to group these data together into a directory (or a few directories) that can be backed up?
- How much data are you backing up, and how long will it take?
Determine when data should be backed up
- If the data to be backed up are fairly static, making a point-in-time copy may be
sufficient. You should keep a minimum of two copies (two “buckets”) in the archive
space, keeping the last successful backup intact until the current backup has been
completed and verified.
- This goes hand-in-hand with the amount being backed up; you need to make sure one backup of a dataset finishes before the next starts
- If the data change on a daily basis, running full backups periodically and using sync to add incremental differences would be a good choice
- Data may be backed up manually after changes have been made to “checkpoint” the data
Determine how data should be backed up
- In order to minimize the archive space needed, tar your files (and zip) when appropriate, keeping in mind potential loss of metadata such as creation and modification times if you do so
- Sync vs. Copy
- When the destination location does not already exist, copy and sync are identical in terms of content (but with copy the creation data is not preserved)
- When the destination already exists with content, they are different, and sync is
a faster option
- Sync will make changes to make the destination identical to the source, including deleting files/folders that no longer exist at the source
- Copy will only copy over files, but will not delete files at the destination if they no longer exist at the source
- Automate your backups using a script and a cron job
- See templates on the Rclone page for both an example rclone sync and a cron job to run it on a regular basis
- Remember to check the logs of these runs to confirm that your data were backed up!
- Backups should be performed from the Data Transfer Nodes (DTNs), not from other CHPC resources
- intdtn01-4 are suitable for on-campus transfers
- dtn04-8 are suitable for both off- and on-campus transfers
- Note that UBox and OneDrive are off-campus resources
Other considerations
- Keeping copies of data in “cold storage” (external disks or tape, not connected to a computer) or as immutable backups (offered by some cloud storage providers) can help protect against ransomware
- All storage hardware has a finite lifespan that is affected by factors such as the frequency of reads and writes and temperature, so if you are storing data on your own hardware, plan to use redundant storage and replace storage media over time