Yesterday morning the NIH released their Data Management Sharing Policy (DMS) and of course there went the rest of MY week/month/year/2021. But of course I have thoughts and summaries and then questions. I’m deeply beholden to John Wilbanks of SageBio, whose quick read through yesterday morning bailed me out as I had to spend 7 hours in Zoom calls but could still alert my community before I had capacity to do a proper check for the changes and final version. (Also, if you’re one of the people I sent emails to Thursday night things comments here will sound shockingly familiar).

Ten years ago almost to date — as I was just starting at UIC — the National Science Foundation announced their data management plan policy — sparking what has become my professional work over the ensuing decade. In this, the NIH has been slower than I expected. There were two rounds of drafts spread across nearly four years and ultimately we’ve always known something was coming but it was a surprise to realize just how long it’s been between NSF and NIH.

The policy also doesn’t go into effect until January 25, 2023. I am deeply curious as to why we’re being given two years of preparation time, though truthfully there are points of the policy where two years will not be nearly enough time for me, my institution, or anyone else’s to fully prepare, so I’m going to take it as a gift and do as much collaboration, education, training, infrastructure advocacy, and all the other things I possibly can prior to that. For example– for many of us to create new procedures/policies at our institutions we need a year lead time at minimum. If I start that in January 2021 — I can have it implemented and familiar to people by late 2022. Maybe.

Very interestingly in and in a method I’ve not seen before, the released document first goes into extensive details about what was modified from the draft that was released December 2019 — for which I worked with colleagues to create a draft that was signed by our Vice Chancellor of Research. At least one friend has said they didn’t like this format, personally I really enjoyed it. The policy itself will live elsewhere, this was the Here’s the Update and What We Changed, Plan to Go Forth.

So, what have we here:

  • All grant proposals for the NIH will have a 2 page data management and sharing plan required of them at point of initial submission (not at Just in Time, thank goodness)
  • Various Institutes, Centers, and Offices may supplement — AHRQ already has given us much more detailed guidance:
  • “Data” means enough to Validate and Replicate research findings “regardless of whether the data are used to support scholarly publications”
  • Said plan must respect patient security and privacy — we’re not just dumping data on the open web with cancer diagnoses etc. Tribal sovereignty is specifically called out.
  • Costs for long term storage can be prepaid
  • Costs for personnel for data management can be included in the budget
  • It explicitly mentions that compliance is required as approved by the ICO– what we promise, we will have to be able to deliver.
  • Specific Elements of a Plan are listed in a Supplement. Expect to see a lot of instruction tied to that.
  • Recommendations for use of established repositories and guidance on that in a supplement.
  • Data sharing: “Shared scientific data should be made accessible as soon as possible, and no later than the time of an associated publication, or the end of performance period, whichever comes first.” —this is important, we’ll come back to this.
  • There is a compliance section that holds institutions and investigators accountable to what is put into the plans.

Thoughts and questions:

  • This is a much more compliance-related policy than what we saw from NSF. That is not unexpected but it is very intriguing. NSF’s initial ask was –and remains– “what are you doing?” NIH is leading with “and we plan to hold you accountable to this plan.” This will likely make researchers cautious but it doesn’t excuse them from data management and sharing. And it gives something for universities to use when researchers initially draft something technologically impossible. It also gives us something to lean on when advocating for infrastructure.
  • Data to validate and replicate the findings even if it wasn’t published. This expands what data will need to be retained by a significant amount and includes data that led to null or negative results — data the authors weren’t writing up beyond the grant report or which the journals weren’t interested in. It doesn’t demand publication, that’s not reasonable, but it says we have to plan to keep and share this data. To me this also reads that researchers cannot rely on sharing only meaning “we will publish scholarly articles about this.” This could be especially useful on the front of clinical trials, where we are also seeing increased data governance, retention, and reporting requirements. I anticipate that I will be asking “is there enough here for X colleague down the hall/ at another institution and their first year student to replicate this? No? Great, what metadata (protocols, data dictionary, read me) is missing?” No one’s method sections are ever comprehensive enough.
  • Share Your Data: “Shared scientific data should be made accessible as soon as possible, and no later than the time of an associated publication, or the end of performance period, whichever comes first.” — Reading this and then re-reading this… wow I’m fascinated by this statement. As soon as possible and no later than publication/performance period. I’m reminded of a non-biomedical researcher who was whining a few years about about the potential of being obligated to share his (entirely NSF funded) research data because “but I’m not DONE with IT! Even though I gathered it 15 years ago and haven’t touched it since!” [I only slightly exaggerate]. This in and of itself may be why we’ve been given two years to plan. This is going to require a significant change in how data soon data is shared and what that looks like. Presently we have very few repositories, institutional or higher, that are prepared for data sharing of sensitive human subject data. What mediation tools and authentication will we need –it will have to be far beyond IRB. How will we prevent this from swamping our investigators with navigating more of these demands? How will this work with patent research? What if you have one grant finishing but the data from the project is on several grants? Which one wins? I do not have answers for this but I am anticipating a thousand questions.
  • The plan requirement was initially proposed to be at the Just in Time stage and I was so very much against that. I foresaw too many data librarians being called upon at the 11th hour of a JIT and being told “oh, but I didn’t write anything in the budget about long term preservation, so the library/university/IT will just have to do that without money from my grant.” Fortunately, quite a few of us apparently raised that concern and NIH concurred that for budgeting and planning, institutions in particular needed investigators to be planning earlier in the grant submission stage.
  • Patient Privacy. Whether they go far enough with this and how is unclear. The policy clearly states that this does not supersede or override any federal or state protection laws already in place. It specifically notes the sovereignty of Tribal Nations over their data. The most promising clause I found was “access to scientific data derived from humans should be controlled, even if de-identified and lacking explicit limitations on subsequent use.” — This is a change. I have seen assertions — less so in the biomedical sphere — but far too much in research about “it’s anonymous!” (Somewhere Kristin Briney just made a rude noise.) when the data is at best partially de-identified. I force all of my students to read LaTanya Sweeney’s work on k-anonymity to try and counter some of that. This statement will impact informatics work and Big Data work — or it had better impact them. I like this; this is good. Now we need to re-examine our controls and how to best use them to advance science and health and protect the human subjects who have generously shared their data with us.
  • “When you feel like it” — Something that the NIH had in the draft was a lot of handwaving phrases like “as long as deemed useful” and “to be determined” much of which would have meant investigators would have shoved off making stated decisions about retention and access until “later” and the inevitable “when I’m done” — nearly all of that has been removed from the final policy. It’s not set that one has to follow one given timeline but there is far less allowance for writers to not commit.
  • No really, SHARE. NIH leans on putting data into established repositories and gives some suggestions (one is PubMedCentral for small datasets tied to publications). They moved away from language that was more about researchers sharing on request, which we know is deeply problematic on perceived race, gender, ethnicity, location, and wealth beliefs about the person requesting the data. They’ve also updated their guidance about repositories (yes, I worked on a response about that too –it’s been a respond-y year).
  • Two pages isn’t enough. That was something expressed in the public comments and feedback. Yes, I agree. It was never going to be enough. It’s a statement of intentions that you include with the grant, the same way that the literature review you include isn’t all of the literature you consider when working on your research. It should be an extract of a much larger and more robust document/set of documents.

These are my first thoughts…. I’m sure I’ll have more in weeks to come. What were your first impressions?