I've run into this notion so often that I now plan for it: in the minds of many commercial software has more credibility than open source software, which in turn has more credibility than custom-built software. And since these perceptions are often held by those that control my budget - perceptions matter.
In the past I've tried to address this perception by doing my homework - careful analysis, benchmarking and testing of multiple solutions, reviews by industry experts, etc. In the end none of it mattered because this perception was rooted in something emotional or unconscious. I'm now long past the naive notion of "using the best tool for the job". These days that determination is quite a bit more nuanced.
The particular piece of software that my colleague was questioning was our ETL solution - the extract, transform, load process that manages the integration assembly line between operational source systems and the destination data warehouse. There are quite a few commercial and a few open source frameworks to manage this kind of a process. However, I chose a custom-built python library approach instead. This is a risky approach if you're not extremely familiar with ETL processing - the performance, reliability, manageability, and data quality challenges are substantial when you're talking about 150 million rows a day across a diverse set of data types and dozens of different sources.
And the industry is of no help here - there are no books I am aware of that describe how to build an ETL solution from scratch, there are no established patterns to work from, and accepted "best practices" simply involve purchasing commercial software. But ETL is hardly rocket science. It may be complex, and it may be non-intuitive, but it's definitely a manageable challenge - if you are already familiar with the domain. Without this hard-won prior experience the likelihood of success is extremely small.
Beyond the lack of ready information for building you own ETL solution there are quite a few arguments in favor of using a pre-built solution:
- It may have extract adapters already built for some of your sources
- There may be no purchase costs involved - through either open source or a site license agreement
- You may lack the skills to build your own
- You may already have skill with a tool
- You may need the bulk of the solution in place ASAP
And in terms of build-speed the commercial & open source solutions lose out every time. Back in the mid-nineties analysts discovered that COBOL programmers were more productive than developers using popular ETL products. With today's far more productive languages like python that gap has only increased. Some argue that the real benefit of the commercial and open source tools is in maintenance - where metadata repositories are powerful tools for impact analysis. But I believe this is only true for the simplest 80% of the work. That crazy last 20% is usually far, far more complex in the framework than it would be with a general purpose language and set of libraries.
So, count me as an advocate of building your own ETL solution in certain circumstances. And if we had more established patterns open source libraries for ETL - then I'd be an advocate for this approach in almost all circumstances.
No comments:
Post a Comment