Part 5: Long-Term Preservation and Archiving of Data

Kim Learns about Data Preservation and Access

Kim has come a long way in her understanding of data management planning. In the previous parts of this tutorial, she learned how to document data and about the importance of using standards in such documentation and description. She now appreciates the need to be clear in the DMP about how the data will be shared and made accessible, and what the time frame will be for making them available. Kim has also been discussing with Dr. Smart what kinds of derivatives from the data might be possible and interesting for related research communities. She is now in a position to finish the DMP by addressing storage and preservation of Dr. Smart's data.

Part 5 of this tutorial guides you through the process of where to store your data, once your project concludes, so that you may be assured of long-term preservation of the data for ongoing access. You'll be able to address the following in this part of the DMP:

Disciplinary data repositories that are applicable to the research data sets you will be collecting and sharing. (Note: some repositories have requirements in terms of types of data, descriptive standards, and size of data.)
Information about Penn State's repository service, ScholarSphere [1], where researchers have deposited data to share them and ensure persistent access to them.
Tips on how to store your data for safekeeping.

A fanned out stack of stickers that read Open Data

The U.S. government is increasingly in support of increasing public access to results of federally-funded research.

Credit: By Jonathan Gray [CC0], via Wikimedia Commons [2]

5.1 Issues Regarding Long-Term Access

As you consider where to deposit your data, think about, as well, how long you will make your data available and accessible, after your project ends. In addition, how much of your data will you make available? All of it - from the raw files to the processed outputs? How often will you need to access it? How will you enable other users to make use of it, particularly if the use of the data requires the application of other tools or systems?

Ben Goldman discusses some of the challenges and resources available to the preservation of digital data.

Link to YouTube video. [3]

5.2 Data Repositories

As mentioned earlier, there may be data repositories suitable for the data that your project will produce. To find such repositories, you may wish to consult Databib [4] - a growing list of repositories for research data primarily in the sciences and social sciences. Data sets stored in a disciplinary repository have some advantages, including a greater likelihood of discovery by other researchers. Another benefit to researchers in having your data made available in a repository is that it is more widely accessible and citable. Andrew Stephenson attests to the advantages of data repositories in the following video.

Link to YouTube video. [5]

Examples of disciplinary data repositories:

Dryad [6] - for data sets in the applied biological sciences that are linked to published articles
ChemxSeer [7] - for chemistry data sets
ESA (Ecological Society of America) Data Registry [8] - for data sets in ecology
ICPSR (Inter-University Consortium for Political and Social Research) [9] - for social sciences data
IEDA (Integrated Earth Data Applications) [10] - for “observational solid earth data from the Ocean, Earth, and Polar Sciences”

For help in selecting the appropriate data repository for your data, consult the Libraries’ Data Management mailing list, l-data-mgmt@lists.psu.edu [11].

5.3 Penn State's Repository Service, ScholarSphere

Sometimes, there is not a disciplinary repository for your data, or if there is, then it may have requirements for data set deposits that your data cannot meet - such as requirements in size, format, documentation, etc. In such a case, consider depositing your data to ScholarSphere [1], Penn State's institutional repository. ScholarSphere takes any file format, and there is no maximum amount of data that users can deposit (although there are upload maximums because deposit occurs via the Web).

As Andrew Stephenson describes it, ScholarSphere is a time saver as well.

Link to YouTube video. [12]

ScholarSphere is a self-deposit repository service ensuring the long-time preservation of data for ongoing access. No registration or creation of an account is necessary. The service is available to anyone in the Penn State community to use - all that is required is a current Web Access ID.

To learn more about ScholarSphere, visit its Help page [13].

5.4 Taking a Distributed Approach to Data Storage

Depositing your data to a formal repository such as those mentioned in the previous section is a good practice for research projects.

However, preserving your data and making them accessible in only one place is not enough. A distributed approach to storing your data is highly recommended. By being part of a campus community, a researcher has options beyond local storage of her data. One should investigate options beyond campus as well. This is something librarians and archivists can help with, as described in the video below.

Link to YouTube video. [14]

Below are ways you can distribute storage of your data (based on U. Minnesota Libraries' "Storing Data Securely [15],".):

Local options - easy to access your data and control access, but you are responsible for backing up that data,
- Internal hard drive (computer hard drive)
- External hard drives - extensive storage capacity is increasingly inexpensive to purchase
- College or departmental servers, local networks
Campus-based options - some at no cost, others for a fee; facilitates collaboration; users have less control.
- Penn State Access Accounts Storage Space (PASS) [16] - administered by Information Technology Services (ITS), requires a Web Access Account ID; 500 MB to start, can be increased to 10GB.
- Tivoli Storage Manager (TSM) [17] - administered by Applied Information Technologies in ITS, fee-based file backup service
- High-Performance Computing [18] - administered by Research Computing and Cyberinfrastructure [19] in ITS, fee-based.
Cloud-based options - someone else takes care of your data and manages it; not recommended for sensitive data, because it's third-party storage.
- Subject-based repositories such as GenBank
- Commercial services such as Amazon Web Services [20]
- Box.com [21], Dropbox [22], ElephantDrive [23], Google Drive [24], Jungle Disk [25], SpiderOak [26]

Quick Tips for Storage and Backup of Data

Keep at least three copies of your data
Have “master” or original files from which copies get made
Put files in external but local storage, such as an external hard drive (but not on optical media)
Also, put files in external but remote storage, or on remote servers

This way, files are physically (geographically) dispersed for disaster recovery purposes.

5.5 Summary

In the last section of the DMP, be sure to discuss how the project will store and preserve the data. This entails mention not only of any data repositories where the project will deposit data but also how, for the duration of the project, data storage will be handled and managed and kept secure. A distributed approach to data storage is the standard to follow, which includes maintaining at least three copies of data; keeping a "master" file for the sole purpose of making copies, and keeping files both in external hard drives and in external but remote storage or on remote servers.

Data sets deposited into a disciplinary repository have some advantages, including a greater chance of discovery by other researchers in your field because of their familiarity with such a repository. Examples of disciplinary data repositories (more of which can be found in DataBib):

Dryad- for data sets in the applied biological sciences that are linked to published articles
ChemxSeer- for chemistry data sets
ESA (Ecological Society of America) Data Registry- for data sets in ecology
ICPSR (Inter-University Consortium for Political and Social Research)- for social sciences data
IEDA (Integrated Earth Data Applications)- for “observational solid earth data from the Ocean, Earth, and Polar Sciences”

Occasionally, there is not a disciplinary data repository available for your data. In such cases, you should consider depositing your data sets to Penn State's repository service, ScholarSphere, which is a self-deposit service that takes any format and requires no creation of an account. The only requirement for deposit is that you have a current Penn State Web Access ID.

Check Your Understanding

True or False: ScholarSphere is a free service for all Penn State researchers.

(a) True
(b) False

Click for answer.

ANSWER: (a) True. ScholarSphere, Penn State’s repository service for all its faculty, students, and staff does not charge any fees for usage. All that is required is a Penn State Web Access ID. Once you log into ScholarSphere, you automatically become a user (whether you deposit files or not). Currently, there is also no limit to the number of files you may deposit (i.e., no maximum on storage size for deposited files). ScholarSphere takes any file format; it can accommodate web uploads of single files up to 500MB and a folder of files up to 1 GB. Since it has Dropbox integration, files larger than 500MB may be uploaded to ScholarSphere via Dropbox. Finally, while the default access level is open access, users may adjust access to “Penn State only” or to “private.”