Data management and curation
- Include metadata right in the file names. Example: "Sub01_strooptest_jan2019"
- Carefully document all phases of pre-processing and analysis of your data.
- Include a CHANGESlog.txt file with your project at the top level of the directory. This file should detail any changes to your dataset over time.
- Consider future access to your dataset. When possible, use non-proprietary file types that are widely used in your field. Proprietary files are not ideal because if the company that created the software that made the files goes out of business, there may be limited access to the software required to work with those files in the future. New versions of existing software may also not support those file types once use of the software falls out of popularity. Of course we realize this may not always be possible, especially in some modalities.
- As new standards in the field emerge over time, files in longterm storage may need to be updated to newer formats where appropriate. This is especially true of data that will see continued use, for exampled in a longitudinal study.
- Use only appropriate UiO storage and analysis environments to avoid threats to data security and integrity. Do not use your personal or office computer for storing or analysing data. See our guide Environments for Data Management and Analysis for more information about your options. Make sure files, including older datasets, are securely stored and backed up at regular intervals.
Prevent Data Degradation
Data degradation, or "bit rot" can occur when older files or the hardware they are stored on become corrupted in even small ways over time, rendering the files unusable.
- Avoid reading CDs of older data sets on old optical drives (they have a greater tendency to damage the disc than newer ones).
- Always eject flash drives correctly prior to removing them from the computer.
- Avoid leaving files open for long periods.
- Save new versions of files once another has been written over multiple times and document when it was created.
- Save older data in archive files (.rar, .zip) using non-solid compression (the default). This allows you to right click and check for damage using 'Check Archive' and then repair if necessary. Regular, non-archive files cannot be repaired in this way, nor can solid-compression archive files.
- Another method is to use a checksum to check data integrity (see the recommended readings for more information).
- Verify back-up copies of files by checking file size, date of last edit, or the checksum.
- Perform backups on two separate forms of media. Copy files to a new storage medium every 2-5 years, as the disc/drive itself can also degrade over time.
- When possible, use open-source programming languages such as Python, R, or Ruby. Matlab code may also be accessed via the open-source program GNU Octave with little to no adaption.
- Use functions rather than large blocks of redundant code. Name your functions based on their purpose. This makes the debugging process easier and helps others to understand your code through simplification.
- Be clear about what dependencies are needed to run your code.
- Comment liberally! Comment at the beginning of a program to describe what it does and who created it. Comment prior to functions and other cohesive blocks of code to explain their specific purpose. BUT: Do not "comment out" parts of the code as a method of changing a program's purpose.
- We recommend the use of version control programs like Git to track and approve changes to the code, especially when coding with a team.
- Share your code with the broader community! Obtain a DOI for the program to publish along with related data or articles so it can be easily referenced and located.
- Anonymize headers to files that contain sensitive information embedded in them as a default at the source. For example, tick the box "supress patient data" when exporting DICOM files from the scanner.
- Always keep the data key, which connects the participants to the ID linking to their data, stored separately from the data itself. Only allow access to those who have a true need to access.
- Generalize data such as age, birthdate, location, and other indirect identifiers when possible. Re-linkage can occur when metadata (such as name, age, or gender) can be used to identify the participant along with other data (for example, neuroimaging data).
- Shift dates for participation so that participants cannot be identified based on the date of the session(s).
- Consider the use of hash functions instead of a regular data key (see recommended resources for more information).
- Plans for data management should be explicitely stated when sending applications for ethics approval. If long-term storage and sharing of the data is a goal, this must also be stated, both to review boards and to participants at the consent-gathering stage. If you later decide to retain and share your data set, you will need to not only send a change request to the ethics board, but also re- obtain consent from participants.
- Handling data on certain groups also presents special challenges when it comes to both data security and anonymization. Examples of this include certain patient groups (especially those who face stigmatization), minors, persons of certain ethnic origins, political background, sexual orientation, or gender minorities.
- Certain types of data are also inherently challenging when it comes to security, sharing and anonymization. These include medical imaging, recordings (audio and visual) and genetic materials. See our guide Environments for Data Management and Analysis for information about safe storage of sensitive data.
- Keep a data dictionary with well defined coding.
- Always keep a copy of your raw data.
- Use software that checks for consistant formatting and ranges. OpenRefine is a good option for cleaning up datasets.
- Consider two-pass verification for manual data entry. Two members of the study's team enter data and then the two documents are reviewed for accuracy. Discrepancies can pinpoint errors that may have otherwise been overlooked.
- Keep audit records of what in a dataset was changed, why it was changed and by whom. This can be done either manually, or using version control programs like:
Apache Subversion https://subversion.apache.org/.
Set access controls to readonly for those who do not need to change the data.
- Record all changes to master files and make sure collaborators always have the most recent version on hand.
Published May 20, 2020 6:07 PM - Last modified May 21, 2020 1:00 PM