When big data and complex workflows meet the reality of finite data storage: a discussion of best practices for data management

"tapes, backup" CC-BY-SA 2.0 by Martin Abblegen via flickr

“tapes, backup” CC-BY-SA 2.0 by Martin Abblegen via flickr

Scientific workflows — for many of us, it’s a love/hate relationship. We love the fact that they help us keep our stuff organized, but hate the overhead required to maintain them. And then when we find out that our meticulously maintained workflow hasn’t captured some important detail? Oh the frustration!!

This discussion will be broadly about managing scientific workflows, and I hope to hear from everyone about the tools and tricks you have for keeping track of which outputs match with which inputs to an analysis, with which models, and which parameters, which figures, papers, and projects all of those things are connected to. It would be great to hear about a wide range of strategies ranging from how you organize and name your files to how you’ve implemented a workflow management tool like Kepler.

I also hope that we can spin up ideas for workflow management problems people may be facing, so if you have a workflow-related issue or question that you’d like to get input on, please let me know. I’ll make sure you get a few minutes to describe your problem or question so that you can get ideas from the crowd.

And if you’re reading this and thinking “I’m a workflow management pro, and don’t need any help with or ideas for managing my workflow,” then please come to the discussion! We (well, at least I) need your help. I have a homegrown scripted workflow management system for the text analyses I do, which does a great job of capturing a lot of details and documenting relationships between inputs and outputs, but requires me to purge unused outputs (e.g., outputs for all but selected runs of a model) manually. How do the rest of you keep track of which files you can throw away down the line and which need to be kept indefinitely? I need to downsize my data storage and am a little worried about making mistakes when I do this manually, so would love to hear ideas about how to build functions like this into my system.

Hope to see you all for a fun discussion!