The most interesting part of the discussion was when an audience member asked if the panel could recommend any tools to help with the data scrubbing. The answer was "no".
I spoke with the panel members afterwards and found that they were completely unfamiliar with data warehousing and of course, ETL. So, appears the author of "Data Analysis with Open Source Tools". So, is just about everyone I've met working in this field.
Of course this is mostly just because data warehousing came from the database community and the new interest in data analysis has come from the programmer community. There's certainly no problem in having a different community re-explore this space and possibly find new and better solutions. The problem is that the more likely scenario is a vast number of failed projects that fail because of performance, data quality, or maintenance costs associated with solving this problem poorly.
- (E)xtract, (T)ransform, (L)oad - a very structured process for replicating & remapping data into a data warehouse from external sources
- Capable of change detection
- Capable of remapping, transforming, and looking up data data
- Capable of validating & scrubbing data
- Batch-file oriented (though some support micro-batches)
- Tools tend to be complex and expensive. Open source alternatives appear to be simplistic reserve essential functionality for commercial versions.
- Tools tend to be specialty software that most good programmers view as a career dead-end.
- Tools are optimized to organize and manage a vast amount of simple processes - are over kill for a small number of simple processes, and are hard to use for really hard transforms. Bill Inmon ("father of data warehousing") recently said that if your ETL tool can handle 95% of your needs you're getting your money's worth. So, you should assume that ETL products are an incomplete solution.
- The ELT variant actually performs transformations within the database (sometimes using stored procedures). I never recommend this as databases are expensive to scale up and these solutions are not very scalable.
- Data Warehouse advocates always recommend the use of ETL tools (vs custom software) - but perhaps this has more to do with their backgrounds as "data guys" who don't write code than anything else. Note that back in the mid-90s quite a few people thought that you wouldn't have to be a programmer to use these tools - that business analysts would do it.
- The graphical visio-like front-ends on many tools reflect the lingering notion that ETL should be performed by non-programmers (and also help to sell the tools to management). However, in spite of the lack of evidence that a colors & shapes based graphical front end is more effective than a traditional programming interface we still see the tool vendors pushing it. Note that it would be like building all server processes using a Rational-like CASE tool and insisting that all web pages were made with Microsoft FrontPage.
- The activity tends to be completely distinct from data analysis tools. I think this is due to the nature of data warehouse projects - although it's been known since the very beginning that data warehouse projects should use iterative methods they tend instead to use waterfall methods. I think this is due to the upfront costs, difficulties in securing millions of dollars to pursue the project, then resulting is a need to quantify the benefits long ahead of time. As the costs continue to grow the tools also grow more expensive and specialized. On most data warehouse projects the ETL developers and report developers are completely different people with different specialized skills.
- I've never encountered a book on ETL design patterns - but one is long over due. The advent of higher-level languages has made the development of custom ETL solutions extremely practical. However, failure to follow recognized approaches usually leads to failure.
- The one thing that's really good about commercial tools is that for a team that's new to the activity the tooling may nudge them in the right technical direction. Maybe. And it's no silver bullet.
- In my opinion unless an ETL solution is completely non-commercial - then there's no real value to it being "open source": one vendor I spoke with had two options: the free one and the commercial one - and there was no way that I could really implement anything significant without the commercial license. The notion that their product is "non-proprietary" just seems to be a gimmick.
- Additionally, at least right now the open source contenders are new and offer mostly light-weight ETL solutions at low cost. Just the thing for small projects, but non-optimal for most large projects.
ETL Tools (small sample set)
- IBM's Data Stage (was Torrent) - tons of parallelism, very fast
- Ascential's Informatica - easy to use
- Ab Initio - tons of parallelism, very fast
- Pentaho's Kettle - cheap, immature
- Prizm - earliest tool may have ended up dead at IBM (via Ardent, then Informix, then Ascential)
- ETI - may have ended up dead at IBM
- Oracle's Warehouse Builder (was Carleton)- ok
- Microsft's DTS (Data Transformation Service) - horrific, and fortunately dead
- Microsoft's SSIS (SQL Server Integration Services) - if you want to stay on windows
- Talend - inexpensive, modest tool
In spite of twenty years of work in this space and a vast amount of literature that describes the dangers of getting ETL wrong the number of projects that I see that get this right are a distinct minority. So, here's a the top ways I've seen projects drive the ETL bus into the ditch:
- Using replication & stored procedures - replication (operationally fragile, may require backup sync tool, poor scalability) when combined with stored procedures (mostly terrible languages, mostly perform poorly at volume, push too much work into centralized resource) create a perfect storm of nastiness.
- Combining extract, transform and load into a single program - there are occasional needs to combine some of these (like when handling concurrent delta processing), but mixing these together typically just results in a very complex and hard to test code base that doesn't perform or distribute well.
- Using the database for transformations - transformations can be performance-intensive, databases can be expensive to scale. Don't mix the two.
- Very object-oriented, real-time ETL - programmers without any experience on a data warehouse usually go this route (and also combine all activities into a single program). They don't understand that the real-time code is hard to scale, doesn't lend itself to use as a recovery tool, etc.
- Up-front $250,000 cost - most projects with high up-front cost never make it past the feasibility study phase. This just increases the changes of being deferred and delayed.
- Up-front political battles over tool - many enterprises have defined approved tools, and any tool purchase will involve either an easy rubber stamp (perhaps for an inappropriate tool) or a battle. Battles up-front can completely delay & derail the project.
- Failing to manage data quality - poor data quality is one of the principle reasons for analytical project failure. Given a large number of feeds, each having various rules for processing and dozens of fields with specific data quality rules, and most (in my experience) coming from undocumented and sloppy systems DQ starts out horrible and gets worse from there. Failing to plan for this and build the right kind of features to aid in DQ manageability can really increase the chance of project failure.
- Buying a solution - you can't "buy" ETL - you can at best buy an ETL tool. There's no silver bullet in ETL, it's simply hard. Purchasing a tool may help (especially an experienced team), but it's not a silver bullet. In general: if it's powerful then it's complex, if it's easy to use then it's simplistic. The tool doesn't do the analysis & design (the hardest parts), nor does it prevent the team from designing baroque catastrophes.
ETL Best Practices (IMHO)
- Assembly-line metaphor
- Separate extract, load, transform processes
- Autonomous and independent processes rather than long sequences of complex workflows
- Ability to handle duplicates
- Process & rule auditing
- Ultimate commodity interfaces - flat files
- Minimal coupling - should be easy to work on multiple feeds simultaneously, run an extract on another network without connectivity to rest of system, etc.
ETL Relevance to Data Science
- Data science doesn't need to improve its "Data Scrubbing" activity - it needs to completely throw it out and adopt the far more mature ETL best practices.
- Except...the graphics-driven, commercial-tooling may not be the best tool for a fast-moving team looking at a lot of different sets of data.
- So the idea applies, even if the current crop of tools aren't quite perfect.
So, What Does the Ideal Solution Look Like for Data Science Applications?
It's easy to describe the typical data warehouse project: usually have 1-2 data feeds for metrics and another 4-8 feeds for reference data, but could have 20+ metric feeds and 20-30+ reference data feeds. The typical costs and cross-departmental nature of these projects, and perhaps the inflexibility of relational databases encourage waterfall methods and an approach to data feeds that involves a separation of up-front analysis, feed development, and data analysis.
Data science projects on the other hand are different by degree in four critical ways that affect the choice of ETL product:
- Less specialization - everyone is primarily a programmer - there are few DBAs, reporting specialists, etc. So the division between ETL & Reporting developers are not so clear-cut.
- Less fear of programming - again, everyone is a programmer though they may specialize in some other software (hadoop, cassandra, etc) or skills (stats, machine learning, etc) as well. This means that there's not a necessity in a shapes & colors UI for ETL.
- Less upfront costs - many of the projects are start-ups, and enough of the tools (and data) are freely available that the incremental costs do not mount up nearly so quickly as in a data warehouse. Because of this the projects are more likely to get funded in the corporate world out of discretionary funding, but also more sensitive to even modest ETL licensing costs.
- More feeds - far more data is available today than was available ten years ago and I think this explosion of both public & private data is still just in the infant stage. The result of this is both an opportunity and aa need to rapidly analyze and transform many more feeds than we considered ten years ago on data warehouse projects.
With this in mind, I think many data science projects may find that the ideal ETL solution has these characteristics:
- Free, at least at first.
- Trivial to install
- Trivial to learn
- Trivial to extend
- Easy to run on a desktop and then migrate to dedicated servers
- Libraries & utilities rather than frameworks
- Designed for programmers, not analysts
- Automatically acquires & reuses metadata
- Allows an incremental approach between transform/validate <-> analysis