Living with Data

2011-03-18

Data Warehouse ETL for Data Scientists

At Strata I attended a discussion panel in which a number of speakers described the various types of work involved in data science: data scrubbing, data analysis, presentation, etc. The general consensus was that data scrubbing was the most time-consuming task in data science. I've also found this to be true on data warehousing and data mining projects, so no surprise here.

The most interesting part of the discussion was when an audience member asked if the panel could recommend any tools to help with the data scrubbing. The answer was "no".

I spoke with the panel members afterwards and found that they were completely unfamiliar with data warehousing and of course, ETL. So, appears the author of "Data Analysis with Open Source Tools". So, is just about everyone I've met working in this field.

Of course this is mostly just because data warehousing came from the database community and the new interest in data analysis has come from the programmer community. There's certainly no problem in having a different community re-explore this space and possibly find new and better solutions. The problem is that the more likely scenario is a vast number of failed projects that fail because of performance, data quality, or maintenance costs associated with solving this problem poorly.

Analysis and the 'So What' Question

While at Strata I had an opportunity to participate in quite a few sessions that demonstrated how to take raw data and analyze it with various tools. The output was usually a set of graphs, charts, etc, though sometimes just simple tables. All of this was useful to get a sense of how the tools work, but what was missing was the final step in the analysis - a powerful insight or understanding that one could use to make an intelligent change to a process. Generally, the presentation technique was fine, the tools were great, but the demonstrated impact of the tools was trivial.

One reason for this is that some of the presenters may have to hold back on their most significant discoveries until the right time - and this just wasn't that time, or this wasn't the right audience. I can understand this - since most of my best analysis can't really be shown without getting NDA and other agreements in place first. Another reason is that the presenters might have wanted to focus on the tool and not the data or business being studied which is just serving as a necessary example to work on. But this is misguided, since delivering insights is the bottom line - not delivering pretty pictures. The last reason I can imagine is that delivering powerful insights is hard, and while these presenters are working on it they may not yet have a suitable example. And I think that this is the most likely answer.

My concern is that people spend a lot of time building gorgeous but empty-headed analytical solutions that just don't have much to say. This is pretty similar to the chart junk problem that Edward Tufte complains about. To make this a little more clear I've included a few examples below.

Breadth of Data vs Depth of Analysis

One of the things that I felt was missing from O'Reilly's Strata Conference was a nuanced sense of the trade-offs between complex analysis and vast volumes of data. Because there is a trade-off and I've seen it play out consistently. It works like this: where do you spend your investment?

deep analysis - with unpredictable costs and benefits
broad sets of data - with predictable (high) costs and benefits

Buy, Reuse or Build ETL Software?

While talking to someone today he mentioned a concern about my team's "homegrown" software: that it would nickle & dime us to death compared to "more robust commercial software". I respected this guy - he was very bright and had a lot of successes under his belt. But I also felt that he was both echoing a common corporate perception, and was quite wrong.

I've run into this notion so often that I now plan for it: in the minds of many commercial software has more credibility than open source software, which in turn has more credibility than custom-built software. And since these perceptions are often held by those that control my budget - perceptions matter.

Mashups vs Data Warehouses

Mashups have come from the application side of IT, warehouses from the data side. They overlap quite a bit - but there's not a lot of thought of how to leverage the best of both worlds.

Parallel Database and Hadoop Costs

In all the hype around Hadoop, and maybe the "micro-hype" around parallel databases it's pretty easy to find exciting anecdotes to support these architectures: numbers of nodes in a cluster, speed to calculate or move data, etc. Finding the costs is much more difficult - and without the costs how does someone make a decision on the merits of the solution?

How absolute are absolutes?

When modeling a system that needs to last a long time or be used by a very large number of users - you can run into problems with absolutes. It turns out that very few aren't subject to occasional change or are otherwise subject to interpretation:

weight - the kilogram is based on a reference original as well as forty copies. However, simply touch or dropping one can make subtle differences in weight - and in 2009 it was discovered that none of these weight exactly the same amount - each are off by a handful of micrograms. Some of these have grown steadily lighter in weight over time.
length - the meter was initially defined in Paris in 1790 as the length of a pendulum with a half period of one second. In 1791 it was changed to be one millionth of the earth's merridan along a quadrant running through Paris. In 1793 the previous definition was discovered to be long by one fifth of a millimeter. Between 1795 and 1799 two reference bars were created (first brass then platinum). Then between 1889 and 2002 the definition was changed five times to consider temperature, gravity, and atmosphere, to be based on atomic decay, then be based on a the speed of light over a fraction of a second.

I just got my copy of Data Analysis with Open Source Tools - by Phillip K. Janert, published this year by O'Reilly.

I'm very excited about this getting this book since it addresses my weak area in data science and business intelligence: the math. The structure of the book is great, the products that he covers are perfect - NumPy, matplotlib, R. gnuplot, etc. His writing style is fine. I'm looking forward to really digging into this more than any technical book I've come across in a few years.

As I slowly work through this book I'll analyze the content more thoroughly. But, in the meanwhile, I've done a spot-checking on his coverage of reporting, business intelligence and data. And the gaps that I've found between the state of these domains and what the book describes - is completely consistent with what I've generally found in the data science writing: a vast lack of understanding of BI.

Premature Optimization of Analytical Architectures

I've been noticing a pattern over the last year - more developers talking about technology optimized for millions of users, and fewer developers talking about technology optimized for hundreds or thousands of users.

Since most engineering decisions are a matter of compromises - a billion user requirement results in a loss somewhere else. Maybe data quality, maybe software maturity, maybe functionality, maybe hosting costs. And some of these compromises may interfere with the project getting to 10000 users.

Living with Data

2011-03-18

Data Warehouse ETL for Data Scientists

2011-02-14

Analysis and the 'So What' Question

2011-02-10

Breadth of Data vs Depth of Analysis

2011-01-28

Buy, Reuse or Build ETL Software?

2010-12-26

Mashups vs Data Warehouses

2010-12-23

Parallel Database and Hadoop Costs

2010-12-21

How absolute are absolutes?

What's Missing From Data Science? or A Partial Review of Data Analysis with Open Source Tools

Premature Optimization of Analytical Architectures