Data Fridays: An Academic’s Fears

So along with all of these thoughts and wonderings about data, I keep coming back to what the barriers and problems will be.

One of my biggest concerns is that, like the scholarly journals that faculty put so much time into, we’ll see data science becoming a hugely monetized and monopolized business supported by the work of researchers, governments and private researchers. This would promote a closed, have/have-not system and, truly, we don’t have time or money for another one like that. Our institutional budgets can’t support it.

When I talk to people about my expectations of data storage and curation in the future, I give three possibilities that I see most likely

1)      Data will be hosted in local repositories.  These will be institutional specific and while it may encourage internal collaboration and make some progress, data will be hoarded on a school by school level, begrudgingly shared  only when people feel their grant funding won’t be threatened by sharing. This could very easily end up going the route of institutional repositories and Roach Motel.

2)      Data will be hosted by consortia of universities.  No one has much money these days but I imagine the Ivy League schools could scramble up a few more dollars than the local community colleges and state schools. Here, data and funding could be pooled much like our library resources, so that if you participate and put in data, you’re also able to borrow data. This allows a wider exchange of ideas. We’re seeing things like this with various branches of science but there is great inconsistency in who/what/how and they aren’t necessarily crossing types of science, despite the need for data to be translated from one field to another.

3)      Data will end up stored by a publisher/distributor that charges an arm and leg for it.

Now, what about the government?  We’ve seen a few interesting things come out of there and yes, there is Data.Gov* and this week I saw DataONE.  But considering the state of affairs happening with the budget crunches and incredibly harsh rhetoric that comes out of every story tied to Washington DC at the moment, I’m not sure that’s something we can rely on.  In June we saw Vivek Kundra leave Data.gov : that looked suspiciously ominous.  However, that might be something that proceeds more locally.   The Data Portal for Chicago is an example of a source of a lot of different and relatively current data.  I can pull information from the Chicago Public Library and make all sorts of pronouncements on people’s reading habits, choice of books, exposure to books from movies or celebrities, etc etc.  Someone from the social sciences might look at titles and determine our social well being based on what we’re reading and how often. Urban planners might look at how circulation patterns change when a library is opened or closed and then compare that to school data.

But that by no mean is my only concern:

How do we move fast enough to at least keep up with the innovations or the onslaught.  Many of the librarians I talk to tell me that once they start offering data services they are getting a flood of data. Do we have the resources to take it on and manage it?  How do we manage permissions? How do we find enough petabytes just to store it and make a back up?

If a library feels they are prepared, and few of us seem to, how are we marketing our services? How are we working with IT departments?  What about public libraries whose budgets are already slashed?  How are they supposed to find time/money to hire someone just to manage incoming data from patrons? Who judges what data a library takes in and stores–particularly in a public setting?

How do we convince people to share their data openly? That was the first question I heard from a friend of mine who is headed into the medical field. What is their motivation?  And that’s another huge question—

1)      Tenure presently relies heavily on publications. So, presently, I want to hold onto my data for as long as possible and squeeze as many papers as I can out of it before I share it with other people.  More journals are requiring sharing of data but at best, it’s begrudgingly shared and often in an as inaccessible way as possible.

2)      I’m not aware of anyplace giving equal tenure credit to someone who is putting out good data sets as opposed to a good article. Are you?

3)      Fear that it will be monetized by someone else.  They looked at my dataset and found a way to make money.  Do they then owe me money for using my data?

4)      Fear of being proven wrong or shown how data is being misused or misrepresented.

If we already have an information overload, will data overload be the next problem? Now we’re not just sorting through the articles and books and summaries, now we have the mountains and oceans of raw data to consider.  And there are only 24 hours in a day.

Loss of unique branches of science. Heard this a lot at AAAS. We’re doing so much collaboration, so much is integrated, that the science barriers are getting a little fuzzy.  If we blur the lines too much, will the discipline improve or suffer? How will people identify?

If we start to rely on datasets that are free or at least readily available, what will it take to get funding for a new dataset? When will we agree that we need new and more data rather than relying on a set from five years before.  How do new PIs convince funders that funding to gather a new dataset is needed?

How do we prevent the paywall?

And I haven’t even gotten to the side of consumer fears…..

What other questions am I not thinking to ask?

*Did you know about that? I didn’t until fairly recently. Tell me I’m not the last person in the loop on that. There’s also Health.Data.Gov and Law.Data.Gov