Data Warehouse ETL for Data Scientists

At Strata I attended a discussion panel in which a number of speakers described the various types of work involved in data science:  data scrubbing, data analysis, presentation, etc.  The general consensus was that data scrubbing was the most time-consuming task in data science.   I've also found this to be true on data warehousing and data mining projects, so no surprise here.

The most interesting part of the discussion was when an audience member asked if the panel could recommend any tools to help with the data scrubbing.  The answer was "no".

I spoke with the panel members afterwards and found that they were completely unfamiliar with data warehousing and of course, ETL.  So, appears the author of "Data Analysis with Open Source Tools".  So, is just about everyone I've met working in this field.

Of course this is mostly just because data warehousing came from the database community and the new interest in data analysis has come from the programmer community.   There's certainly no problem in having a different community re-explore this space and possibly find new and better solutions.  The problem is that the more likely scenario is a vast number of failed projects that fail because of performance, data quality, or maintenance costs associated with solving this problem poorly.

ETL Defined
  • (E)xtract, (T)ransform, (L)oad - a very structured process for replicating & remapping data into a data warehouse from external sources
  • Capable of change detection
  • Capable of remapping, transforming, and looking up data data 
  • Capable of validating & scrubbing data
  • Batch-file oriented (though some support micro-batches) 

    ETL Characteristics
    • Tools tend to be complex and expensive.  Open source alternatives appear to be simplistic reserve essential functionality for commercial versions.
    • Tools tend to be specialty software that most good programmers view as a career dead-end.
    • Tools are optimized to organize and manage a vast amount of simple processes - are over kill for a small number of simple processes, and are hard to use for really hard transforms.   Bill Inmon ("father of data warehousing") recently said that if your ETL tool can handle 95% of your needs you're getting your money's worth.   So, you should assume that ETL products are an incomplete solution.
    • The ELT variant actually performs transformations within the database (sometimes using stored procedures).  I never recommend this as databases are expensive to scale up and these solutions are not very scalable.
    • Data Warehouse advocates always recommend the use of ETL tools (vs custom software) - but perhaps this has more to do with their backgrounds as "data guys" who don't write code than anything else.   Note that back in the mid-90s quite a few people thought that you wouldn't have to be a programmer to use these tools - that business analysts would do it.
    • The graphical visio-like front-ends on many tools reflect the lingering notion that ETL should be performed by non-programmers (and also help to sell the tools to management).  However, in spite of the lack of evidence that a colors & shapes based graphical front end is more effective than a traditional programming interface we still see the tool vendors pushing it.   Note that it would be like building all server processes using a Rational-like CASE tool and insisting that all web pages were made with Microsoft FrontPage.
    • The activity tends to be completely distinct from data analysis tools.  I think this is due to the nature of data warehouse projects - although it's been known since the very beginning that data warehouse projects should use iterative methods they tend instead to use waterfall methods.  I think this is due to the upfront costs, difficulties in securing millions of dollars to pursue the project, then resulting is a need to quantify the benefits long ahead of time.  As the costs continue to grow the tools also grow more expensive and specialized.  On most data warehouse projects the ETL developers and report developers are completely different people with different specialized skills.
    • I've never encountered a book on ETL design patterns - but one is long over due.  The advent of higher-level languages has made the development of custom ETL solutions extremely practical.  However, failure to follow recognized approaches usually leads to failure.
    • The one thing that's really good about commercial tools is that for a team that's new to the activity the tooling may nudge them in the right technical direction.  Maybe.  And it's no silver bullet. 
     Open Source vs Closed Source Options
    • In my opinion unless an ETL solution is completely non-commercial - then there's no real value to it being "open source":  one vendor I spoke with had two options:  the free one and the commercial one - and there was no way that I could really implement anything significant without the commercial license.  The notion that their product is "non-proprietary" just seems to be a gimmick.
    • Additionally, at least right now the open source contenders are new and offer mostly light-weight ETL solutions at low cost.  Just the thing for small projects, but non-optimal for most large projects.

    ETL Tools (small sample set)
    • IBM's Data Stage (was Torrent) - tons of parallelism, very fast
    • Ascential's Informatica - easy to use
    • Ab Initio - tons of parallelism, very fast
    • Pentaho's Kettle - cheap, immature
    • Prizm - earliest tool may have ended up dead at IBM (via Ardent, then Informix, then Ascential)
    • ETI - may have ended up dead at IBM
    • Oracle's Warehouse Builder (was Carleton)- ok
    • Microsft's DTS (Data Transformation Service) - horrific, and fortunately dead
    • Microsoft's  SSIS (SQL Server Integration Services) - if you want to stay on windows
    • Talend - inexpensive, modest tool
    ETL Anti-Patterns:

    In spite of twenty years of work in this space and a vast amount of literature that describes the dangers of getting ETL wrong the number of projects that I see that get this right are a distinct minority.   So, here's a the top ways I've seen projects drive the ETL bus into the ditch:
    • Using replication & stored procedures - replication (operationally fragile, may require backup sync tool, poor scalability) when combined with stored procedures (mostly terrible languages, mostly perform poorly at volume, push too much work into centralized resource) create a perfect storm of nastiness.
    • Combining extract, transform and load into a single program - there are occasional needs to combine some of these (like when handling concurrent delta processing), but mixing these together typically just results in a very complex and hard to test code base that doesn't perform or distribute well.
    • Using the database for transformations - transformations can be performance-intensive, databases can be expensive to scale.  Don't mix the two.
    • Very object-oriented, real-time ETL - programmers without any experience on a data warehouse usually go this route (and also combine all activities into a single program).  They don't understand that the real-time code is hard to scale, doesn't lend itself to use as a recovery tool, etc.
    • Up-front $250,000 cost - most projects with high up-front cost never make it past the feasibility study phase.  This just increases the changes of being deferred and delayed.
    • Up-front political battles over tool - many enterprises have defined approved tools, and any tool purchase will involve either an easy rubber stamp (perhaps for an inappropriate tool) or a battle.   Battles up-front can completely delay & derail the project.
    • Failing to manage data quality - poor data quality is one of the principle reasons for analytical project failure.   Given a large number of feeds, each having various rules for processing and dozens of fields with specific data quality rules, and most (in my experience) coming from undocumented and sloppy systems DQ starts out horrible and gets worse from there.  Failing to plan for this and build the right kind of features to aid in DQ manageability can really increase the chance of project failure.
    • Buying a solution - you can't "buy" ETL - you can at best buy an ETL tool.   There's no silver bullet in ETL, it's simply hard.  Purchasing a tool may help (especially an experienced team), but it's not a silver bullet.  In general: if it's powerful then it's complex, if it's easy to use then it's simplistic.  The tool doesn't do the analysis & design (the hardest parts), nor does it prevent the team from designing baroque catastrophes.

    ETL Best Practices (IMHO)
    • Assembly-line metaphor
    • Separate extract, load, transform processes
    • Autonomous and independent processes rather than long sequences of complex workflows
    • Self-Healing
    • Ability to handle duplicates
    • Process & rule auditing
    • Ultimate commodity interfaces - flat files
    • Minimal coupling - should be easy to work on multiple feeds simultaneously, run an extract on another network without connectivity to rest of system, etc.

    ETL Relevance to Data Science
    • Data science doesn't need to improve its "Data Scrubbing" activity - it needs to completely throw it out and adopt the far more mature ETL best practices.
    • Except...the graphics-driven, commercial-tooling may not be the best tool for a fast-moving team looking at a lot of different sets of data.   
    • So the idea applies, even if the current crop of tools aren't quite perfect.

    So, What Does the Ideal Solution Look Like for Data Science Applications? 

    It's easy to describe the typical data warehouse project:  usually have 1-2 data feeds for metrics and another 4-8 feeds for reference data, but could have 20+ metric feeds and 20-30+ reference data feeds.   The typical costs and cross-departmental nature of these projects, and perhaps the inflexibility of relational databases encourage waterfall methods and an approach to data feeds that involves a separation of up-front analysis, feed development, and data analysis.

    Data science projects on the other hand are different by degree in four critical ways that affect the choice of ETL product:
    • Less specialization - everyone is primarily a programmer - there are few DBAs, reporting specialists, etc.  So the division between ETL & Reporting developers are not so clear-cut.
    • Less fear of programming - again, everyone is a programmer though they may specialize in some other software (hadoop, cassandra, etc) or skills (stats, machine learning, etc) as well.  This means that there's not a necessity in a shapes & colors UI for ETL.
    • Less upfront costs - many of the projects are start-ups,  and enough of the tools (and data) are freely available that the incremental costs do not mount up nearly so quickly as in a data warehouse.  Because of this the projects are more likely to get funded in the corporate world out of discretionary funding, but also more sensitive to even modest ETL licensing costs.
    • More feeds - far more data is available today than was available ten years ago and I think this explosion of both public & private data is still just in the infant stage.  The result of this is both an opportunity and aa need to rapidly analyze and transform many more feeds than we considered ten years ago on data warehouse projects.

    With this in mind, I think many data science projects may find that the ideal ETL solution has these characteristics:
    • Free, at least at first.
    • Trivial to install
    • Trivial to learn
    • Trivial to extend
    • Easy to run on a desktop and then migrate to dedicated servers
    • Multi-platform
    • Libraries & utilities rather than frameworks
    • Designed for programmers, not analysts
    • Automatically acquires & reuses metadata 
    • Allows an incremental approach between transform/validate <-> analysis


      1. Nice blog for more other experiment related to mud battery please visit MudWatt

      2. I like it your blog. I have read your blog its very attractive and impressive.
        corporate training institutes in chennai corporate training companies in chennai corporate training companies in india corporate training companies in india corporate training in chennai corporate training companies in chennai corporate training in chennai | Angular 2 Corporate Training Angular 2 Corporate Training

      3. The blog gave me idea about Data ware house ETL Thanks for sharing it
        SAS Training in Chennai

      4. Excellent Sharing. You have done great job. I gathered lots of new information. . Devops Online Training | Data Science Online Training

      5. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post.R Programming Online Training | Hadoop Online Training

      6. Thank you for this valuable information. I have got some important suggestions from it. I'm working in Brave Technologies Private Limited. We provides lowest price of ERP Software for our clients. Contact us on info@bravetechnologies.in.ERP software Chennai

      7. Nice blog. Thank you for sharing. The information you shared is very effective for learners I have got some important suggestions from it. erp providers in chennai.

      8. really informative blog. thank you for sharing this blog.
        android training in chennai

      9. This is seriously good, you have really highlighted some of the great points one should know. Awesome work
        Wonderful suggestions and guidance
        selenium training in chennai

      10. Data warehousing projects are part of the IT infrastructure of most large enterprises. ETL processes are part of these projects, but developers sometimes make the same mistakes when designing and maintaining these processes. Сompany . Active Wizards identifies a number of such mistakes:
        - A data warehouse (Datawarehouse, DWH) is a traditional SQL DWH (Oracle Database, MS SQL Server, etc.);
        - The modeling of DWH usually refers to the concepts of single version of truth (a single version of truth) and historical truth (historical truth);
        - The ETL process (Extraction-Transformation-Loading) refers to the process of loading data from one or more source systems (source systems) into DWH.
        - DWH was not created yesterday and at the moment it is independently run by several development teams with their projects.

      11. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
        Android Training in Chennai
        Ios Training in Chennai

      12. Great blog.you put Good stuff.All the topics were explained briefly.so quickly understand for me.I am waiting for your next fantastic blog.Thanks for sharing.Any coures related details learn...
        Data Science Online Training

      13. I have learned data warehouse concepts from your blog. You have explained it clearly, keep on sharing more like this.
        SAS Training in Chennai | SAS course in Chennai | SAS Training Institutes in Chennai

      14. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.

        R Programming Online Training|

        Tableau Online Training|

        SAS Online Training |

      15. Nice blog has been shared by you. it will be really helpful to many peoples who are all working under the technology.thank you for sharing this blog.

        Data science training in Marathahalli|
        Data science training in Bangalore|
        Hadoop Training in Marathahalli|
        Hadoop Training in Bangalore|

      16. The Blog share valuable Information which was very much useful My sincere thanks for sharing this post please Continue to share this kind of post
        Devops Training in Bangalore
        Devops Training in BTM Layout
        Devops Training in Marathahalli

      17. Besant Technologies Bangalore is one of the best training institutes in Bangalore which provides more than 45 training courses for students, working professionals, job seekers and corporate employees with placement assured training in bangalore. Our training programs are designed to fulfill each student requirements and catering to achieve the desired career goals. Our courses are tought by realtime trainers working in MNC and Corporate companies in chennai with strong technical skills and practical knoweldge to focus more on placement training in bangalore.--- AWS Training in Bangalore |
        DataScience Training in Bangalore |

      18. This comment has been removed by the author.

      19. Nice! Before I had an doubt about Data Warehouse ETL for Data Scientists. Now, I got clear by doubts through your explanation. thanks to share this blog.

      20. This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information....
        Data Science Training in BTM Layout
        Java Training in Marathahalli

      21. I have read your blog and I gathered some needful information from your blog. Keep update your blog. Awaiting for your next update. Data Science Training in Chennai | Java Training in Chennai

      22. Training in Bangalore is one of the best Selenium Testing Training Institute in bangalore. We offer Online Training and Classroom Training. Our Trainers are very professional and more than 8 years of experience.Training in Bangalore

      23. As such, the data warehouse can be the foundation of an important strategic tool, which assists management in both seeing historical trends and predicting future ones. Armed with this information, management can create an effective strategy for success.seo dashboard

      24. Interesting post! This is really helpful for me. I like it! Thanks for sharing!

        Webseiten Gestaltung Lüdenscheid

      25. The website is looking bit flashy and it catches the visitors eyes. A design is pretty simple .
        office 2010 professional plus key deutsch

      26. I am commenting to let you know what a terrific experience my daughter enjoyed reading through your web page. She noticed a wide variety of pieces, with the inclusion of what it is like to have an awesome helping style to have the rest without hassle grasp some grueling matters.

        selenium training in chennai|
        selenium training in bangalore|

      27. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts, have a nice weekend!


      28. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly. 
Software Testing Training Center in Chennai | | Best Selenium Training Institute in Chennai | Six Sigma Certification Training in Chennai

      29. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts, have a nice weekend!

        Data Science Training in Chennai

      30. your blog contain very useful info and good points were stated in the blog which are very helpful one. Thank you for sharing.. Graphic Designing Training with placement in Chennai | Multimedia Training with Placement in Chennai

      31. Your good knowledge and kindness in playing with all the pieces were
        very useful. I don’t know what I would have done if I had not
        encountered such a step like this.

        AWS Training in Bangalore

        AWS Training in Bangalore

      32. Awesome Blog. Data Science is one of the most important concept, such a important concept you explained very well. Thanks for sharing such a nice blog..Android Training Institute in Chennai | IOS Training Institute in Chennai | Core Java Training Institute in Chennai

      33. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly. 
No.1 Software Testing Training Institute in Chennai | Best Selenium Training Institute in Chennai | ISTQB Certification Center in Velachery

      34. a pride for me to be able to discuss on a quality website because I just learned to make an article on
        cara menggugurkan hamil

      35. really enjoyed while reading your article, the information you have delivered in this post was damn good. Keep sharing your post with efficient news.
        No.1 Software Testing Training Institute in Chennai | Best Selenium Training Institute in Chennai | ISTQB Certification Center in Velachery

      36. It has been simply incredibly generous with you to provide openly what exactly many individuals would’ve marketed for an eBook to end up making some cash for their end, primarily given that you could have tried it in the event you wanted.

        Java Training in Bangalore

      37. Interesting post! This is really helpful for me. I like it! Thanks for sharing!
        Mobile application developers in Chennai | PHP developers Chennai

      38. This is Really great, you have truly featured a portion of the colossal focuses one should know. Wonderful work
        Get more Customer Reconciliation
        Market Audit
        CFA Audit

      39. This comment has been removed by the author.

      40. Your very own commitment to getting the message throughout came
        to be rather powerful and have consistently enabled employees just
        like me to arrive at their desired goals.

        aws training in chennai

      41. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.

        java training in bangalore

      42. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
        python training in bangalore

      43. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.
        big data training in Chennai

      44. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.

      45. Really useful information. we are providing best data science online training from industry experts.

      46. Your new valuable key points imply much a person like me and extremely more to my office workers. With thanks.

        Informatica Online Training|ETL Testing Online Training|Hadoop online Training

      47. Good blog. Much obliged to you for sharing. The data you shared is extremely viable for learners I have some essential proposals from it. Duplicate Payment Recovery | Daily Transaction Verification | Duplicate Payment Recovery| Fraud Dectection