Data: Cultural Transition

This has been floating around in my head and out to several people in conversations over the last month, finally getting it into pixels for everyone else.

As we’re undergoing the change to open data, further exploration of data science, and the various other growing pains that go along with that, there are two cultural shifts that will need to occur.

Culture Status Quos

1) A scientist and researcher that I work with told me that the overarching opinion in scientific research  “all the best science is new data.”

2) I have heard from administrator types that there is not value in creating a good data set–only the analysis and published journal articles are important and valuable for tenure promotion considerations.

These are broad sweeping statements and certainly there are many more factors that come into account, but I think this is a good place to start from as we’re improving the future of data use, citation, and acknowledgement.

Insofar as the best data being new data, that’s both true and false and really depends on the field. Archaeology and paleontology leap to mind as areas where old data are particularly relevant, yet even there the search for new things: new ancient species, new tombs, new uncovered civilizations, etc keeps scientists naming dinosaurs and National Geographic in cover stories.  And to say that scientists aren’t considering old data at all would be incorrect.  They are, they’re just not necessarily looking at the old data set in entirety.  The researchers are building instead on the data that is included and published in the various journals, which is usually a subset of the data gathered during the study. But when we’re talking about grants, talking about new research, then we often come across the need not to pull together data sets already gathered or examine where a data set already exists, instead we see the desire to gather new subject data, despite the probably higher expense.

In medicine, I can point to one example where old data is being considered but not fully–systematic reviews.  There, the authors collect and gather hundreds of studies (where available, topic dependent), review the methods and results, and come up with overarching evidence based treatment recommendations and guidelines.  While I’m sure it would mean a lot more work for those systematic reviewers, what if they could pull the datasets behind all of those studies together? What other trends might they see?

So how do we encourage scientists to go back to the actual data set? How do we promote the idea that a data set can be used more than once, perhaps by someone else?

Part of the challenge is accessibility and part of it relies on citations. If we can make data more easily available and findable, readers are more willing to take a look at it. We can’t require every reader to make the effort of contacting the author or journal or institution, waiting on them to get back to you, negotiating to read the data (when not sensitive information–I know with health information there will need to be more rules), and then getting the data some time two weeks after they’ve read the article.  I have researchers who tell me that anything greater than one click is onerous for them when using the library website to find our electronic journals, so I don’t have great optimism that they’ll find time to contact all of the authors.  Also, that puts a huge burden of response on the author(s)–who have many other things they need to be working on as well.

The culture of citation? I think that relies on developing standard ways of citing data.  Some fields are citing data, many are not.  We need to get away from the current APA citation, which says only something like “unpublished raw data” and find something a little more descriptive and unified.

Researchers also need motivation to make their data accessible and citable. While that is starting to come down from funders (NIH, NSF, etc), it also is going to need to be driven by administrators and institutions who are recognizing the value of data.

Speaking of those administrators and moving on to point two, how do we get administrators to move beyond the idea that journal articles are the only tangible

Well, one of those ways is for scientists to start using and citing more datasets.  Ah, yes, it’s a bit of circular logic–get the researchers to cite data by admins supporting it but admins won’t support it unless the researchers in the field are citing data and they see that it is a trend.  Truly though, it’s going to take well respected people writing articles, blog posts, etc that cite datasets, writing a new analysis on a released dataset and showing the further impact.

There is one way to hold the administrators feet to the fire and that is the NSF Data Management Plan requirement.  When a major funding agency says that you must start thinking about data preservation or lose your grant funding, that’s a wake up call that comes in very real terms of dollars and cents. That’s a place to start, but not a good one to finish.

Another opportunity is to look for a campus initiative that involves the words translational or cross disciplinary work. If you want researchers to work together, how better than to break their data out of silos where, at least institutionally, they can collaborate and share their knowledge across fields.

Student education through use of good data sets is another opportunity, if students can work with data sets already formed then a) they don’t have to take the time to gather new data b) they can be taught to review data sets that are available and develop a culture while learning of using and sharing data and c) hopefully they’ll take the opportunity to learn what makes a good data set so when they do create their own, it is more accessible in the future.

Finally, I’m curious to know if there are any disciplinary awards for good data sets and the sharing thereof.  If you can get an award for it (and why shouldn’t you), that begins to add prestige to the data set category.

Wise readers, what can you tell me?  Did I miss anything in those culture shifts? Are there other opportunities you are undertaking?  Are there awards out there that I’m missing?

 

 

One Comment

  1. Comment by Alan Schwartz:

    The Dataverse Network software (behind Harvard’s IQSS and other organizational sharing/archiving efforts) promotes a common citation format for data sets that seems very reasonable to me.