It seems like every time I turn around these days, everyone is trying to figure out how to manage data, how to wrangle data. We’re swamped with it and trying to collect it, trying to determine the gems from the detritus, and the attempts to figure out how to store it. Some recent examples for me:
1) In a conversation with faculty and staff from the College of Dentistry at UIC, we talked about how they are capturing patient information, much of which could easily be scrubbed of identifiers, in Axiom—a dentistry targeted electronic record. And yet, for all of the schools who are using this resource, a lot of customization may happen, fields may be captured slightly differently, and questions may not be included. So it’s unclear how to gather data across different sites, despite those sites seeming to use the same tool.
2) I took a recent three week course on RDA and FRBR. One of the things that was a recurring theme was how much information we already have in MARC. We need to find a way to make that data available, make it more easily analyzed, used, searchable, navigable, etc. Our current tools give us limited options. RDA was recently pushed back another year or 18 months I think but when it does finally hit, this may open a lot of new options for us.
3) I attended a Synthetic Biology Symposium at Northwestern University earlier this spring. Many of the presenters and others chatting talked about the desperate need to have data more openly available. Resources like GenBank were pointed to as a place where people can build on the work of each other and collaboratively, rather than off in their own labs. The various speakers would talk about building on each others points but they were also stunned at what each other were doing because they didn’t know. A lack communication significantly impairs forward progress in the sciences.
4) And it comes up more casually. A query came on Friendfeed from Rothman about how universities were tracking faculty activity, specifically in classes taught. It’s something that the individual faculty is expected to know and track, but not something we’re capturing well on an institutional level. Here’s is another way for universities to compare themselves to each other, along with the scholarly journals and books produced.
But a lot of this came out of the AAAS meeting that I attended in late February. There Data Science was the word and the way. It’s being hailed as a new type of science and a new way and repeatedly I heard people say that there was need for training to handle data: gathering, preserving, archiving, and access. This is not being perceived as necessarily a library role and that, to me, is a shame. It seems like an ideal place for new growth of services for those with MLS degrees, if perhaps shifting some of the focus of traditional academic libraries.
Data and data science will be a huge leap into a deep and dark new pond, though that is nothing new. Many librarians, myself included, have ended up working in disciplines that we do not fully understand and had to start treading water very quickly. LPL is probably my best example. When I took over the chapter book collection I could speak at best only to what I had read as a child. I had no formal children’s literature training, merely a childhood of incredibly avid reading. After taking it on though, I immersed myself in reviews, in the collection and in blogs. I learned rapidly what my community was seeking and how I might nudge them out of their comfort zones. I bought books that one day may be challenged because they weren’t mainstream pulp, though a bookshelf of Daisy Meadows books will attest that I bought that too.
My point is that right now I think it makes sense to grab librarians to start at the foundation of data science, pick out what we’re doing best and run with it.
In my mind, data science will blend the best of the sciences and the humanities, add a heavy dose of computer science, a healthy helping of librarianship, and statistics to taste. Researchers are going to give us data, increasingly obtaining data is not the question. Whether it be from the NIH Mandate, the NSF Data Management Plans, or just the directions science is generally moving, researchers are thinking more and more about data. An increase in globalized research and the potential for credit for reuse of one’s data pushes us towards openness. We will need computer scientists and programmers to help us build a structure to contain it. They are the architects who I hope win the turf war on who has to store all of this and can manage the conversion to new systems as we go forward as well as helping with backward access. Librarians can be tapped as the interior designers and organizers. We will, not unlike Hoarders, weed through to judge which items are most valuable for the future. We can help researchers know where the data is.
Collaboration is going to be huge. We’ll need not only servers on which to store these potential petabytes of data, but people who can speak the languages, translating between scientific fields. We’ll wants data scientists who can look at data from one field, see an application in another field, and be able to find people in that second or third field who can turn the data into something meaningful . A large consortium of research institutions feeding into and supporting a centralized data science center could take the pressure off of individual universities and I think it is possible. It would provide a supermarket of data ingredients to be infinitely combined to create new recipes and research. We’re in the infancy state of data science but like early childhood, it’s not going to last.
Libraries and institutions will be examining and creating and curating local data. Dorothea Salo gave an interesting presentation on Turning Collection Development Inside Out. The idea is that libraries will not merely be scanning the pool of available data and bringing it to their patrons, they will be scanning what is local and bringing it to the world. This, particularly, allows us to focus on the humanities. One of the themes I heard at ALA this summer was that we’re supposed to be becoming places where patrons can come and create and develop, rather than just gather information. Will we also offer those creators our services as data curators? Are local public libraries prepared to handle that? Will we outsource to a new vendor who offers us a data storing option or home grow our own? How will those be shared?
Oh, certainly some will tell you I’m coming late to the game. All the library journals are scrambling to put out special editions on data curation, data management, etc etc. But I can’t think I’m alone in trying to figure out what is coming next. So, the plan is to have blog posts about Data on Fridays. These will probably long posts while I try to wrap my head around this but I’m hoping that by pulling things together once a week I might start to have slightly fewer tabs open in Firefox with “things I want to talk to people about.” Maybe….