2011-03-18

Data Warehouse ETL for Data Scientists

At Strata I attended a discussion panel in which a number of speakers described the various types of work involved in data science:  data scrubbing, data analysis, presentation, etc.  The general consensus was that data scrubbing was the most time-consuming task in data science.   I've also found this to be true on data warehousing and data mining projects, so no surprise here.

The most interesting part of the discussion was when an audience member asked if the panel could recommend any tools to help with the data scrubbing.  The answer was "no".

I spoke with the panel members afterwards and found that they were completely unfamiliar with data warehousing and of course, ETL.  So, appears the author of "Data Analysis with Open Source Tools".  So, is just about everyone I've met working in this field.

Of course this is mostly just because data warehousing came from the database community and the new interest in data analysis has come from the programmer community.   There's certainly no problem in having a different community re-explore this space and possibly find new and better solutions.  The problem is that the more likely scenario is a vast number of failed projects that fail because of performance, data quality, or maintenance costs associated with solving this problem poorly.