Data Sharing

Quality data, open access.

Lesson 1:

Benefits of data sharing

Summary

This lesson introduces how data sharing is one of the most impactful practices that a researcher can adopt early on in their career. It explains how data sharing not only impacts one's individual career but also advances scientific discoveries. It shows learners how they can be part of the solution to the reproducibility crisis and describes how implementing data sharing principles now can help them easily adapt to as mandatory data sharing requirements evolve. Through examples, learners will gain understanding of how data sharing can lay the foundation for future scientific breakthroughs.

Goal

Gain an understanding on why data sharing is a necessity and evaluate how data sharing (or lack thereof) can impact the advancement of science.

1.1 Our future selves are our first collaborators!

Imagine you've just completed months or even years of rigorous research, collected valuable data, and published your findings. Your work is done, right? Not quite. What happens next with your data could be the difference between your research having a modest impact and creating waves that transform your field for decades to come.

Good science depends on good data, shared properly.

Outside of instances where sharing data is required, some researchers may wonder: “Why should I share my data?” The real question, however, is: “What am I missing if I don't?” 

Some of the most cited papers in major scientific journals don’t just cite the original research. Rather they cite the datasets that have made follow-up discoveries possible. In turn, researchers who share their data contribute to more than one study by becoming part of dozens of future breakthroughs.

Data sharing serves our own research goals while advancing scientific knowledge.

We already know that sharing high-quality data is an imperative professional requirement. These requirements may include funding, compliance, and institutional open science mandates.

For instance, NIH-funded research that generates scientific data must include a data management and sharing plan.

Data management and sharing plans describe:

  • What data will be shared (e.g., preprocessed fMRI datasets, behavioral data, analysis scripts), 
  • How (e.g., shared as BIDS-formatted datasets, CSV files, and Python scripts under CC-BY license), 
  • When (e.g., upon publication of primary results), and 
  • Where (e.g., OpenNeuro repository). 

Nonetheless, the most powerful benefit of preparing data for sharing is that it yields better data for ourselves. This is why successful data sharing starts with careful attention to data quality.

To prepare a dataset for sharing, we must:

  • Document variables, units, and collection methods.
  • Clean inconsistencies and errors we might otherwise tolerate.
  • Create metadata that explains what each column, file, or sample actually represents.
  • Organize files so that someone unfamiliar could understand them.

That “someone unfamiliar” is almost always future you.

Thus, while it is certainly important to share data to contribute to the greater good of science, since it requires careful planning and thorough documentation to be useful to others, data sharing is actually a major benefit to our own future research.

Six months from now, you will return to a dataset for revision, new analysis, or related project. The quality standards you apply for sharing are the same ones that make your own reuse painless. Or even possible in the first place!

Thus, ensuring data quality is a prerequisite for sharing your data.

1.2 The benefits of shared data depend on data quality

Data quality refers to the ability of a dataset to be used to research specific questions. In turn, high-quality data ensures the fitness of data for its intended use and has several key characteristics.

High-quality data can be defined by interlocking practices that ensure reliability, interpretability, and trustworthiness. These practices span how data are collected, verified, documented, analyzed, and ultimately communicated.

In order for data to have these characteristics, our research must prepare for both data quality and data sharing well in advance of the sharing stage.

High-quality data characteristics:

Rigorously collected data begin with strong experimental foundations. This means following validated protocols and incorporating appropriate controls, rather than improvising methods on the fly. Researchers rely on calibrated instruments and thoughtful experimental design to reduce measurement error and bias. Just as importantly, they anticipate potential confounds (variables that could distort results) and take deliberate steps to identify and mitigate them during data collection rather than after the fact.

Proper validation ensures that the data hold up under scrutiny. Planned quality control checks are not optional add-ons but integral steps in the workflow. Measurements are compared against known standards to confirm accuracy, and internal consistency is assessed across replicates to ensure reproducibility. When irregularities arise (like outliers), they are not automatically discarded; instead, they are investigated and documented.

A documented provenance is a transparent history of the data. High-quality datasets include clear records of who collected the data, when, and under what conditions. Whether minor adjustments or unexpected issues occur, any deviations from the intended protocol are carefully noted. Maintaining a chain of custody for samples and datasets further ensures integrity, while all transformations or modifications are reported transparently so that others can trace how the final dataset came to be.

An appropriate analysis connects the data back to the research question. Analytical methods should align with both the underlying hypothesis and the structure of the data, rather than being chosen for convenience or convention. Researchers must also verify that the assumptions underlying their analyses are met. Importantly, high-quality analysis pushes against selectively highlighting favorable outcomes. Instead, it includes the reporting of negative and null results, which are essential for an accurate and complete interpretation.

Finally, transparent reporting ensures that findings can be evaluated and trusted by others. This involves openly acknowledging limitations and sources of uncertainty, rather than overstating conclusions. Negative or failed results are included when they provide meaningful context, helping to prevent misleading narratives. Transparency also extends to disclosing any potential conflicts of interest, allowing readers to assess the work with full awareness of possible influences.

Differentiating high-quality data from FAIR principles
A common misconception is that
FAIR data automatically means high-quality data, however, FAIR refers to how the data is shared, while data quality refers to whether the data is worth sharing.

Let’s look at a few examples more closely!

First, recall that FAIR stands for data that is: findable, accessible, interoperable, and reusable!

Example 1: Data that is FAIR but of poor quality.

Imagine that a researcher shares a calcium imaging dataset following all FAIR principles.

The Good:

  • The data is in a public repository. 
  • It is openly accessible.
  • It is in standardized file formats.
  • And it includes a detailed README element!
Uh-oh!Data-quality problem!
The researcher later discovers that the microscope’s temperature control was malfunctioning during data collection!

The Bad:

  • This caused signal drift they didn’t become aware of until after publication!
  • The shared data is FAIR, but the quality was compromised.

Example 2: High-quality data that is not FAIR.

In this scenario, a research team collects behavioral data with rigorous protocols, daily recorded calibrations, blinded scoring, and excellent quality controls.

The Ugly:

  • However the stored data has a disorganized nomenclature.
  • With cryptic file names!
  • And no data dictionary!
  • Worse: the file types are for proprietary software, limiting who can interact with them!

The data is high-quality, but it is not FAIR. Ideally, shared data should be both high quality AND FAIR.

DEEP DIVEFinding Funder Requirements with FAIRsharing

Once you understand the difference between high-quality data and FAIR data, the next question is often: what do funders actually expect you to do? These expectations can vary widely across U.S. agencies, internationally, and across private foundations.


The FAIRsharing registry is a helpful tool for navigating this landscape. It is a searchable, curated database of data sharing and management policies from funders, journals, and other organizations across disciplines. The tool is a community-driven project wherein “the FAIRsharing team precisely curates information on standards employed for the identification, citation and reporting of data and metadata” (Sansone et al., 2019).


What makes FAIRsharing especially useful is that, in addition to providing information about data preservation and sharing policies, it connects those policies to relevant standards, repositories, and best practices. This means you can move beyond identifying requirements to understanding how to meet them. For example, a funder policy might recommend specific data formats or repositories, and FAIRsharing helps you trace those recommendations and see how widely they are adopted.


This is particularly valuable if you are working across disciplines or applying to multiple funders. Instead of searching each policy in isolation, you can use FAIRsharing to compare expectations, identify common patterns, and anticipate requirements early in your project planning.


In practice, this tool helps researchers incorporate principles and then implement them. It also shows how FAIR ideas are translated into real-world policy and what that means for your data.

1.3 Sharing high-quality data advances both individual careers and scientific discovery

Once we’ve prepared high-quality data (and applied FAIR principles), we can reap the benefits of the far-reaching impacts of data sharing. If you still have questions about how data sharing impacts your career and science broadly, review our impact overview!

Impact overview:

What do I gain?

Data sharing can meaningfully enhance your professional visibility and reputation. By making your data accessible, you increase opportunities for others to cite your work and recognize your contributions. Shared data also opens the door to new collaborations and networking opportunities, connecting you with researchers who build on or extend your work. Over time, this visibility helps establish your credibility as a contributor to open and rigorous science.

Where does my work go?
Shared data extends the reach of your research far beyond its original publication. It can inform policy decisions, contribute to clinical applications, or be reused in entirely different disciplinary contexts. Depositing data in repositories also ensures long-term accessibility, allowing your work to remain available and useful even as technologies and research questions evolve.

How does science change?
At a broader level, data sharing accelerates scientific discovery. When data are available to others, they can be validated, reanalyzed, and combined with other datasets in meta-analyses or large-scale syntheses. This not only strengthens existing findings but also generates new hypotheses and enables faster testing of ideas across multiple contexts and datasets.

Why should we trust it?
Transparency is a cornerstone of trustworthy science, and data sharing makes that transparency possible. When data are openly available, other researchers can independently validate and replicate findings, increasing confidence in the results. This open scrutiny strengthens the overall credibility of the research and helps address persistent challenges related to reproducibility.

Activity: Discover data sharing impact!

Let’s see the impact of sharing high-quality data in practice!

Evaluate the impact of individual participation in data sharing on one’s career as well as scientific progress.

Click anywhere to start

Post-activity questions:

  • Did you observe any ripple effects from your choices that were surprising? How so?
  • How did your data sharing decisions affect scientific progress in your field?
  • Which path generated the most significant impact on knowledge dissemination?

The earlier and more thoroughly that you share data the more data sharing amplifies personal research impact, career advancement, and accelerates scientific discovery.

Researchers who resist sharing high-quality data miss opportunities and actively limit career prospects. Conversely, choices to record and share reliable, clearly organized, and accessible data lead to greater impact in the scientific community year after year!

By increasing visibility of and access to your work, data sharing improves how science is done.

DEEP DIVEWhat exactly do the papers by Colavizza et al. (2024) and Hemphill et al. (2022) reveal?

Curation level impact on download count in How do properties of data, their curation, and their funding relate to reuse? by Hemphill et al. (2022):


In order to examine how data properties, curation decisions, and repository funding models relate to data reuse, this paper evaluated data download logs from the Inter‐university Consortium for Political and Social Research (ICPSR) using ICPSR’s “curation level” framework for standardizing common curation actions (ICPSR, 2020) to measure the impact of curation on downloads.


This framework designates 3 levels of curation:

All datasets go through disclosure risk review and remediation.

Level 1 (baseline curation):

  • Study website with metadata

  • PDF codebook (variable descriptions)

  • Data files in major formats (Stata, SPSS, plain text)

Level 2 (enhanced usability):

  • Includes Level 1

  • Data cleaning + standardization (e.g., fix formats, missing values, spelling, labels)

Level 3 (advanced curation):

  • Includes Levels 1–2

  • Customized documentation

  • Indexes survey questions in SSVD for searchability

  • Covers non-tabular data (e.g., qualitative, spatial)


Using this framework, they analyzed data usage for 380 studies released by ICPSR from January 1, 2017, to April 30, 2021 (excluding studies with restricted use data). They computed the number of “data users” for each study by identifying unique download users based on IP address and using non-ICPSR-staff logins.


They found that data attributes, curatorial decisions, and archive funding models correlated with data reuse, and the most frequently used datasets were those that were provided by institutions, shared with a high level of curation, and received external funding for their access and preservation. Specifically, higher curation levels generated 4x more downloads.


Helping others find the right data:

Interestingly, in some cases, making data available for online analysis is correlated with fewer downloads. This suggests two scenarios:

  1. That many users meet their data needs through online analysis and do not need to download or work with data locally, or

  2. Users may use online analysis to explore the data and decide they are not right for the project.


The paper notes that in Gregory et al. (2020), nearly 75% of surveyed researchers used exploratory analysis to determine a dataset's applicability to their work.


Correlation between data sharing and citation advantage in An analysis of the effects of sharing research data, code, and preprints on citations by Colavizza et al. (2024):


This paper explores whether adopting one or more Open Science practices (like early sharing of results through preprints or openly sharing outputs such as data and code) leads to significantly higher citations for an associated publication. 


To investigate this form of academic impact, the researchers used a novel dataset known as Open Science Indicators (produced by PLOS and DataSeer) It includes all PLOS publications from 2018 to 2023 as well as a comparison group sampled from the PMC Open Access Subset. In total, they analyzed about 122k publications, calculating publication and author-level citation indicators; and used a broad set of control variables to isolate the effect of Open Science Indicators on received citations. 


Dependent variable:

Citation counts were derived from the PMC Open Access Subset using identifier-linked references only. Preprint citations to their published versions are excluded (due to different IDs), though preprints can cite published OSI papers. The main dependent variable is total citations, with additional measures counting citations within 1-, 2-, and 3-year post-publication windows for robustness.


Independent variables:

Controls included publication year and month (to account for citation timing), number of authors, and reference count. Disciplinary differences were modeled using 18 Australian and New Zealand Standard Research Classification (ANZSRC) field categories (these were dummy-coded). Author reputation was controlled for and proxied by the mean H-index of authors (with simple name-based disambiguation). Journal effects were limited to indicators for PLOS and PLOS ONE. Multiple model types were tested, but linear and robust regressions were reported due to similar results and clearer interpretation.


The team found that Open Science practices are adopted to different degrees across scientific disciplines, and the early release of a publication as a preprint correlates with a significant positive citation advantage of about 20.2% (±.7) on average. 


Moreover, sharing data in an online repository correlates with a smaller yet still positive citation advantage of 4.3% (±.8) on average. They did not, however, find a significant citation advantage for sharing code. The paper calls for further research on additional or alternative measures of impact beyond citations.


Takeaways:

  • Data sharing is a strategic career investment. It has benefits beyond its requirements, and is not just an altruistic act. Think about the citation multiplier effect, network building, and visibility expansion of your work.
  • Data sharing creates a multiplier effect that accelerates research discovery and scientific validity.
  • Data sharing is becoming a professional necessity.
  • Optimal preparation of your dataset to provide the best access and use is in your favor.

Reflection:

  • Reflect on a recent study that you were excited to learn about. Did you notice if their data was publicly available or not
  • Consider the data sharing practices in the lab you currently work in. Do these practices allow for transparency and reproducibility?
  • How would you advocate for data sharing when someone says it’s unnecessary?

Lesson 2:

How shared data is used

Summary

This lesson demonstrates how data may be shared and reused in ways we may have never considered. It exemplifies how different types of shared data need different types of supporting information to make it accessible. Through examples, the learners will explore how one dataset can serve various different purposes depending on how it is shared and documented.

Goal

Identify how data can be reused for different purposes and what supporting information each type of data requires.

2.1 How shared research is used: direct replication and validation

Imagine you have just published findings on the role of dopamine in early Parkinson’s disease. Your results are promising, and soon another research team expresses interest in replicating your study!

Replication is a cornerstone of scientific progress. For another team to meaningfully reproduce your findings, they must be able to follow your methods as closely as possible. This means more than understanding your conclusions. They need to be able to access the underlying details that shaped how your data were generated and analyzed.

The team begins by reviewing what you’ve shared. They find clear variable definitions and coding schemes, which is an important first step! These help them interpret the structure and meaning of your dataset. 

However, they quickly realize that this information alone is not sufficient to reproduce your work.

To replicate your study accurately, they also need a fuller picture of your research process. This includes detailed data collection protocols and timelines, so they can understand how and when measurements were taken. 

They also need insight into your quality control procedures to assess how data integrity was maintained. Additionally, they need to be able to follow any data cleaning or preprocessing steps that transformed your study’s raw data into the final dataset used for analysis.

Without this additional context, even well-intentioned researchers may arrive at different results. This is not because your findings are incorrect, but because key details are missing to help others easily access what your study achieved.

Shared data will only help others (and your future self) if it includes all of the relevant information to make use of it in further research.

In addition to the direct replication and validation example, researchers also use shared data for:

  • Investigating secondary research questions.
  • Meta-analyses and systematic reviews.
  • Developing novel methods and models.
  • Education and training.

Each of these cases will need information on study design, variables, and protocols, but may emphasize different aspects of the shared data in their use. For the remainder of this lesson, we will look more closely at secondary research questions and meta-analyses.

2.2 How shared research is used: secondary analysis and new research questions

Say you conducted a study on Alzheimer’s disease and, in the process, collected a rich set of lifestyle information on your patients. Your primary research question may have focused on one specific aspect of the disease, but your dataset likely contains much more information than you needed for that original analysis.

Now imagine another research team comes across your shared data. They are interested in studying changes in sleep patterns, medication adherence, or family history in similar patient populations. While these factors were not the focus of your work, your dataset still contains relevant information that could support their investigation. By performing a secondary analysis, they can explore entirely different research questions without needing to collect new data from scratch.

However, for this kind of reuse to be meaningful, the new team needs to understand the context in which your data were generated. They can have confidence in the relevance and reliability of your dataset only if they can access comprehensive information about its potential constraints. 

For example, they need to know the characteristics of your patient population and how participants were selected, since this shapes how broadly the findings can be applied. 

They will also need clarity on the temporal relationships between measurements like when and how often data were collected. This will allow them to interpret patterns over time. 

Finally, understanding potential confounding factors, whether measured or unmeasured, is critical for drawing valid conclusions from a new analysis.

Without this context, secondary analyses risk misinterpretation. With it, your data can continue to generate new insights well beyond your original research question!

With the right context, your data can support entirely new questions, and extend the impact of your work far beyond its original aims.

DEEP DIVEWant to learn more?

What if instead of using your data for a new question, other researchers wanted to explore what you did to develop new methods?


Methods development and validation

Biostatisticians and computational researchers use your data to develop new analytical techniques or validate existing methods. Well-documented datasets can serve as testing grounds where new tools are evaluated under realistic conditions rather than idealized scenarios.


What other researchers will look for:

  • Data structure and relationships between variables (what goes with what and when): Clear organization and metadata help researchers understand how variables interact across time, conditions, or participants, which is critical for building and testing models.

  • Known gold standard comparisons: Benchmarks such as expert annotations, clinical diagnoses, or validated reference measures allow researchers to meaningfully evaluate performance rather than relying on indirect proxies.

  • Missing data, missing data patterns, and why they are missing: Transparency about gaps in the data (e.g., dropout, equipment failure, exclusion criteria) helps others assess bias and test whether methods are robust to real-world imperfections.

  • Technical specifications of measurements: Details like instrument settings, preprocessing pipelines, and data acquisition protocols ensure that others can interpret variability correctly and replicate or stress-test their methods.


In addition, documenting preprocessing decisions and known sources of noise (e.g., motion artifacts in imaging data) allows researchers to determine whether their methods are sensitive to or resilient against these factors. Highlighting unusual cases or edge conditions can be especially valuable, as these often reveal the limits of existing approaches or the strengths of new ones.


Real-world example: Your fMRI dataset helps validate a new algorithm for detecting early signs of cognitive decline in neurodegenerative diseases! The caveat is that this is true only if researchers can trace how the data were collected, processed, and labeled. With sufficient context, your dataset becomes part of the methodological infrastructure of the field, enabling more reliable and comparable tool development.

2.3 How shared research is used: meta-analysis and systemic review

Fast forward a few years, and researchers are conducting a meta-analysis on the efficacy of stroke interventions. Their goal is to synthesize findings across multiple studies to identify broader patterns and more robust conclusions than any single study can provide.

As they review the literature, they consider including your study’s dataset among the selected papers. This is unsurprising! 

Studies that are well-documented and transparently reported are far more likely to be included in meta-analyses and systematic reviews, because they allow researchers to make meaningful comparisons across datasets. In your case, you provided a thorough understanding of your study’s analyses and outcomes

You also included standard outcome measures and clearly showed how they map to commonly used scales, making it easier to align your results with those from other studies. 

Additionally, you reported population demographics in enough detail to support subgroup analyses, allowing researchers to examine how effects may differ across patient groups. 

Your team also described your study design and potential sources of bias, which are critical for evaluating the quality and comparability of your work. 

Finally, by reporting effect sizes and confidence intervals, you enabled others to quantitatively integrate your findings into larger statistical analyses.

With this level of detail, your study becomes part of a larger body of evidence that shapes how the field understands and evaluates stroke interventions!

The more completely you document and share contextual information, the more effectively others can reuse your data and avoid the pitfalls of missing or insufficient metadata.

Activity: What does the next researcher need?

Let's see how varying levels of documentation impact data reuse!

Review three data usage scenarios. Identify what type of supporting information would be most critical for the data user to have.

Click anywhere to start

Post-activity questions:

  • How are you currently keeping track of the information of your studies? How might you update how it is organized for data sharing? 
  • Why does the same dataset require different supporting information depending on how it is used?

The more completely you document and share contextual information, the more effectively others can reuse your data and avoid the pitfalls of missing or insufficient metadata.

A dataset with perfect measurements but poor documentation will not be useful for further analysis, while the same dataset with rich contextual information becomes a valuable resource for multiple research applications. The difference isn't in the data itself. Rather, the supporting information determines what other researchers can understand, trust, and properly apply from your work.

DEEP DIVEWant to learn more?

Educational and training purposes

Your data becomes a teaching tool for students learning research methods, statistics, or domain-specific analysis techniques (think -omics data, imaging data, sequencing data, etc.). Well-prepared datasets can support both introductory learning and advanced skill development by providing realistic, hands-on experience.


What other researchers will look for:

  • Clear and complete variable names and descriptions: A well-structured data dictionary allows learners to quickly understand what each variable represents and how it should be used.

  • Context about the research problem and its significance: Providing background on the original study helps learners connect analytical techniques to meaningful scientific questions.

  • Appropriate complexity for audience and learning objectives: Datasets that balance realism and accessibility. Retaining some messiness (e.g., missing values, variability), while clearly signaling these features for transparency, can support productive learning.

  • Ethical considerations and limitations: Information about data use restrictions, privacy concerns, and responsible interpretation helps learners develop good research practices alongside technical skills.


To further support training, datasets can include suggested entry points such as example analyses, guiding questions, or starter code. These scaffolds help learners move from initial orientation to active exploration. Importantly, including clearly documented, real-world imperfections gives learners practice making analytical decisions rather than simply following idealized workflows.


Real-world example: Your transcriptomics data helps train research staff in a new core facility how to analyze transcriptomics data, creating an efficient and accessible technique for multiple researchers! Over time, such datasets can evolve into reusable training resources that support onboarding, standardization, and continuous skill development across teams.

Understanding how shared data gets used reveals why metadata is the bridge between your data and its future impact (we will talk in more detail about metadata in Lesson 4: Metadata makes it useful). Every time someone successfully uses your data, they are building on your work and extending your scientific contribution. 

That also means that every time someone cannot use your data because of insufficient context, that is a missed opportunity for both scientific progress and your research legacy.

The value of shared data depends on how easily others can make sense of it.

Takeaways:

  • It’s hard to predict how your data may support other research questions, but by comprehensively sharing all the information from your study, you can support the efforts of other researchers years after collection. 
  • Sharing biomedical data serves many purposes, including for replication, secondary analysis, meta-analysis, methods development, and education.
  • Different uses of data require different supporting information.

Reflection:

  • Compare and contrast the five ways that research data may get used. Which would require the most supporting documentation?
  • Think about your own research project. What information would you need to include for someone else to meaningfully build upon your research and how may this differ from your original research purpose?
  • What motivates the members of your team to share their data? Do they consider how that data may be used?