When big data and complex workflows meet the reality of finite data storage: a discussion of best practices for data management

"tapes, backup" CC-BY-SA 2.0 by Martin Abblegen via flickr

“tapes, backup” CC-BY-SA 2.0 by Martin Abblegen via flickr

Scientific workflows — for many of us, it’s a love/hate relationship. We love the fact that they help us keep our stuff organized, but hate the overhead required to maintain them. And then when we find out that our meticulously maintained workflow hasn’t captured some important detail? Oh the frustration!!

This discussion will be broadly about managing scientific workflows, and I hope to hear from everyone about the tools and tricks you have for keeping track of which outputs match with which inputs to an analysis, with which models, and which parameters, which figures, papers, and projects all of those things are connected to. It would be great to hear about a wide range of strategies ranging from how you organize and name your files to how you’ve implemented a workflow management tool like Kepler.

I also hope that we can spin up ideas for workflow management problems people may be facing, so if you have a workflow-related issue or question that you’d like to get input on, please let me know. I’ll make sure you get a few minutes to describe your problem or question so that you can get ideas from the crowd.

And if you’re reading this and thinking “I’m a workflow management pro, and don’t need any help with or ideas for managing my workflow,” then please come to the discussion! We (well, at least I) need your help. I have a homegrown scripted workflow management system for the text analyses I do, which does a great job of capturing a lot of details and documenting relationships between inputs and outputs, but requires me to purge unused outputs (e.g., outputs for all but selected runs of a model) manually. How do the rest of you keep track of which files you can throw away down the line and which need to be kept indefinitely? I need to downsize my data storage and am a little worried about making mistakes when I do this manually, so would love to hear ideas about how to build functions like this into my system.

Hope to see you all for a fun discussion!

Discussion: Obstacles faced by researchers who reuse, share and manage data, and strategies for overcoming them

Roundtable discussion for Wednesday, 17 Sept 2014
All too often it's an uphill battle for researchers who want to do the right thing. Photo Credit: Steve Garvie from Dunfermline, Fife, Scotland (Uphill struggle!) [CC-BY-SA-2.0], via Wikimedia Commons

Photo Credit: Steve Garvie from Dunfermline, Fife, Scotland (Uphill struggle!) [CC-BY-SA-2.0], via Wikimedia Commons

All too often it’s an uphill battle for researchers who want to do the right thing.




Why is it often harder than it should be to do the right thing when it comes to data management, sharing and reuse? I will introduce seven common sources of conflict that present obstacles to researchers who work with data. These seven sources of conflict were identified through qualitative analysis of transcripts for interviews and focus groups involving more than 35 researchers.


Following a brief introduction of these sources of conflict and resulting obstacles, we will discuss potential strategies for minimizing or overcoming these obstacles. Our conversation will focus around the following guiding questions:
  • For each source of conflict, what should be done to make it easier for researchers to do the right thing?
  • What can research centers like NCEAS do to prepare researchers and/or to improve the status quo?
  • What can I, as an individual researcher, do to avoid and/or prepare for potential obstacles, and to improve the status quo?

You may also be interested in checking out short stories based on some of the interviews: http://notebooks.dataone.org/data-stories/

I look forward to discussing ideas with you on Wednesday!


Big Data and the Future for Ecology

Great to see such a nice turnout for the Roundtable discussion on big data and ecology! I’m posting a slide set from the talk (ESA_Hampton_2012_public), but as you probably noticed, I don’t put much text on my slides, so it probably won’t make much sense to you if you didn’t see the talk! Feel free to email me if you want any details not included here. We are revising a paper on this topic for resubmission to Frontiers in Ecology and the Environment. Frontiers has a  liberal policy on copyright so – if accepted – please rest assured that I and the other authors will make it available on our websites.

I took out the cute xkcd images, but you can enjoy as many as you like by checking out xkcd.com yourself!

What?! You’re still reading this post after checking out xkcd.com? I doubt it, but if you are, then…

Here’s some papers I mentioned:

Some Simple Guidelines for Effective Data Management
Elizabeth T. Borer et al.
Bulletin of the Ecological Society of America 90(2) 205-214

Heidorn, P. B. 2008. Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57:280–299.

Aronova, E., K. S. Baker, and N. Oreskes. 2010. Big Science and Big Data in Biology: From the International Geophysical Year through the International Biological Program to the Long Term Ecological Research (LTER) Network, 1957–Present. Historical Studies in the Natural Sciences 40:183–224.

Shaun shared a blog post that describes the 3 V’s of ‘big data’ – volume, velocity, and variety.