Buy, Reuse or Build ETL Software?

While talking to someone today he mentioned a concern about my team's "homegrown" software: that it would nickle & dime us to death compared to "more robust commercial software". I respected this guy - he was very bright and had a lot of successes under his belt. But I also felt that he was both echoing a common corporate perception, and was quite wrong.

I've run into this notion so often that I now plan for it: in the minds of many commercial software has more credibility than open source software, which in turn has more credibility than custom-built software. And since these perceptions are often held by those that control my budget - perceptions matter.

In the past I've tried to address this perception by doing my homework - careful analysis, benchmarking and testing of multiple solutions, reviews by industry experts, etc. In the end none of it mattered because this perception was rooted in something emotional or unconscious. I'm now long past the naive notion of "using the best tool for the job". These days that determination is quite a bit more nuanced.

The particular piece of software that my colleague was questioning was our ETL solution - the extract, transform, load process that manages the integration assembly line between operational source systems and the destination data warehouse. There are quite a few commercial and a few open source frameworks to manage this kind of a process. However, I chose a custom-built python library approach instead. This is a risky approach if you're not extremely familiar with ETL processing - the performance, reliability, manageability, and data quality challenges are substantial when you're talking about 150 million rows a day across a diverse set of data types and dozens of different sources.

And the industry is of no help here - there are no books I am aware of that describe how to build an ETL solution from scratch, there are no established patterns to work from, and accepted "best practices" simply involve purchasing commercial software. But ETL is hardly rocket science. It may be complex, and it may be non-intuitive, but it's definitely a manageable challenge - if you are already familiar with the domain. Without this hard-won prior experience the likelihood of success is extremely small.

Beyond the lack of ready information for building you own ETL solution there are quite a few arguments in favor of using a pre-built solution:
  • It may have extract adapters already built for some of your sources
  • There may be no purchase costs involved - through either open source or a site license agreement
  • You may lack the skills to build your own
  • You may already have skill with a tool
  • You may need the bulk of the solution in place ASAP
But even if we give the commercial/opens source solutions the benefit of the doubt - there's still quite a few scenarios in which none of the above apply. And at this point the next big question is "which is more productive: a framework or a library?" The available solutions are all frameworks which generally do a great job of the easy 80% of the work. Unfortunately, the very consistency that they enforce gets in the way for that last 20% - where you have to deploy extract programs on remote networks or on obscure operating systems; where you have to handle vast data volumes, or perform incredibly complex transformations. It's there - in the hardest 20% that the custom solution really excels.

And in terms of build-speed the commercial & open source solutions lose out every time. Back in the mid-nineties analysts discovered that COBOL programmers were more productive than developers using popular ETL products. With today's far more productive languages like python that gap has only increased. Some argue that the real benefit of the commercial and open source tools is in maintenance - where metadata repositories are powerful tools for impact analysis. But I believe this is only true for the simplest 80% of the work. That crazy last 20% is usually far, far more complex in the framework than it would be with a general purpose language and set of libraries.

So, count me as an advocate of building your own ETL solution in certain circumstances. And if we had more established patterns open source libraries for ETL - then I'd be an advocate for this approach in almost all circumstances.

No comments:

Post a Comment