2010-12-21

What's Missing From Data Science? or A Partial Review of Data Analysis with Open Source Tools


I just got my copy of Data Analysis with Open Source Tools - by Phillip K. Janert, published this year by O'Reilly.

I'm very excited about this getting this book since it addresses my weak area in data science and business intelligence: the math. The structure of the book is great, the products that he covers are perfect - NumPy, matplotlib, R. gnuplot, etc. His writing style is fine. I'm looking forward to really digging into this more than any technical book I've come across in a few years.

As I slowly work through this book I'll analyze the content more thoroughly. But, in the meanwhile, I've done a spot-checking on his coverage of reporting, business intelligence and data. And the gaps that I've found between the state of these domains and what the book describes - is completely consistent with what I've generally found in the data science writing: a vast lack of understanding of BI.



I don't think that this detracts too much from Janert's book though. Business intelligence and data warehousing are large enough topics that they deserve their own book anyway. And the rest of the content of Data Analysis is also more than sufficient for a complete book.

But an analysis of the gaps may help to identify misunderstandings within the data science community. As Philip Janert corrected stated:
One of the uncomfortable (and easily overlooked) truths of working with data is that usually only a small fraction of the time is spent on the actual analysis. Often a far greater amount of time and effort is expended on a variety of tasks that appear "menial" by comparison but that are absolutely critical nevertheless: obtaining the data, verifying, cleaning and possibly reformatting it; and dealing with updates, storage and archiving.
Is it safe to say that 90% of a data analyst or data science project is data logistics? That was the number that we used to estimate on data mining projects. If this is true, and if data warehouses are the primary data logistics repositories for data analysis, then shouldn't a serious effort be made to develop expertise and skill in leverage data warehouses?

Here's a few observations about what I've found in Data Analysis with Open Source Tools - as well as in other data science writings:

Data warehouse performance -
Janert describes data warehouses as places where queries take over-night to complete. While it's certainly true that this can be the case it's usually a sign of an architectural failure. Over the past twenty years I've worked on quite a few data warehouses - and most provided response time in under a minute. Even my current warehouse at 40 billion rows a year on completely outdated and undersized hardware can complete almost any query in a few hours, though the vast majority complete in seconds.

But of course, there are defective warehouses. The top reasons for warehouse failure are insufficient sponsorship to overcome organizational resistance, bad data quality, performance. Given that warehouse architectures have hardly changed in 15 years - one wouldn't think that technical failures should happen. But they do all the time - mostly when teams with too little domain knowledge try to build based on their experiences with OLTP systems.

And then there are warehouses optimized for a different problem. The warehouse fact tables may be partitioned by day, and maybe customer. This allows for easy loading concurrency and archival/backup/restore while also providing great performance for customer-specific reporting that covers some amount of time (a few days, maybe a month). But if the analyst needs to run queries against the entire customer base, or needs to select data on some other criteria that partitioning strategy won't help performance at all.

Note that some of the focus on sampling methods in this book come from a failure or inability to take advantage of data warehouse performance features. Analyzing just 5% of the total data may provide a considerable performance benefit - but may cause data quality or completeness objections. And the performance benefits of only using 5% of your data may be smaller than the advantages of leveraging data warehouse features like parallelism on a hypothetical system with 20 cores and 320 disks, and partitioning on random days, accounts, devices, etc.


Data Warehouse Agility - As he describes in the BI & Reporting chapter there is a difference in mindsets - warehouses are run like infrastructure and treated with great caution. Absolutely, if you're using your warehouse to drive vital business processes and you crater it due to human error you could be looking at anywhere from hours to a week of downtime.

The challenges associated with managing a warehouse are generally unknown but relevant to most data analysts. The data sources may involve dozens of feeds into a warehouse coming from often undocumented systems, most of which may be susceptible to changes without reliable prior notification. The database generally has to support unpredictable user query sizes and amounts. If the warehouse is to truly affect the organization by providing data to everyone that needs it, performance and quality SLAs need to take top priority. Ahead of rapid changes to enable discovery analysis.

The operational challenges are more driven by the system integration than the products or database size. The maturity of data warehouse ETL methods and products provides considerable help here - but hardly guarantees perfection. Whether this integration happens in the warehouse - or in the "analyst's data zoo" the challenge will be there. If the results have significant affect then it needs to handled with quite a bit of formality. Sometimes a warehouse can take this too far and become almost exclusively operational and impossible to change. I think this is driven more by the organizational culture than the domain best practices - and would affect any solution that affected the business.

Data Warehouse Maintenance - Janert describes care & feeding of the "data zoo", in which tips are provided for how data analysts can keep their data fresh by building a poor man's data warehouse. While most of the ideas are fine, in reality there's an entire body of knowledge about the how best to maintain data of this type that should be examined (ETL). Further, if the data analyst results need to turn into production processes, then it really shouldn't run off his desktop computer with hacked-together R scripts. It really should get moved into the warehouse.

Data Analyst vs Data Warehouse Priorities - data analysts need to understand what the organizational and project priorities are.
  • Basic reporting comes first - before spending money on data mining, linear regressions, simulations, predictive analysis, etc basic & fundamental reporting needs to happen. It's a cart-before-the-house thing.
  • Operational vs Exploratory Reporting - when a warehouse provides reporting in support of vital business processes or decisions there will be an intense and appropriate focus on managing the infrastructure. Exploratory analysis seldom has a price-tag associated with delays and failures. It might be necessary - but it's unlikely to get nearly the priority that a data analyst will want.
So, what's a recipe for success look like?

  • Start with the basics: data scrubbing, standardization, integration, consolidation, and reporting. Don't worry about finding the unknown questions until the known questions have been answered.
  • Team up warehouse & data analyst activities - warehouses should be the provider of a vast amount of cleansed, validated, integrated, historical and standardized business data. Additionally, it should provide integrated with rich reference data also needed for analysis. Organizationally, these groups shouldn't be too far apart - otherwise, they'll find themselves duplicating each others efforts anyway.
  • Consider how data analysis efforts with a wide variety of specialty tools might integrate with more standard reporting, dashboards and OLAP. Ideally, users can navigate to the results of multiple tools from a single topical dashboard.
  • If necessary, provide data analyst teams with their own data marts - fed by the warehouse but with access controls to allow the analysts to easily add their own data, and file system space to allow exporting the data into flat files.
  • Use data analyst teams on cutting edge of warehouse reporting & BIs - once you've hit diminishing returns with OLAP, it may be time to have your data analysts apply more sophisticated methods.
  • Transition proven and valuable data-analyst projects into warehouse projects - don't put the data analyst team in charge of data logistics. As their efforts mature have them hand off the solution to the warehouse team for conversion into a maintainable solution and steady-state support. Of course, if this transition is going to go well - the data warehouse team should have a small part in the initial development. It can make suggestions up front that could avoid major conversions later.
  • Get the data warehouse team reading books from the data science space, and the data analyst team reading books from the data warehouse space. They've each got something to teach the other.
And along with that last point - I've definitely got a lot more reading ahead of me as well...

No comments:

Post a Comment