2011-02-10

Breadth of Data vs Depth of Analysis

One of the things that I felt was missing from O'Reilly's Strata Conference was a nuanced sense of the trade-offs between complex analysis and vast volumes of data.  Because there is a trade-off and I've seen it play out consistently.  It works like this: where do you spend your investment?
  • deep analysis - with unpredictable costs and benefits
  • broad sets of data - with predictable (high) costs and benefits
Analysis Depth

The deep analysis gets the most attention from mathematicians, computer science grads, and data scientists.  Complex algorithms can be sexy, surprising, elegant and beautiful.   What's not to love about them?   Well, one issue is that the complexity can be an obstacle to adoption - if your users don't understand them or their visual interface.   Another issue is that once you deliver a very sophisticated analysis your users will want to sanity check it against more simple analysis.   Which you need to have ready, and which should be done first.   The last issue is one of predictability - it's difficult to know in advance if the analysis will be successful, and which approach will prove most successful.  Because of this it's difficult to estimate how long it will take to perform the analysis and whether or not the results will be useful.

Data Breadth

Broad sets of data mostly get attention from evangelists who want bragging rights.  Aside from the ability to brag about how a process can handle X million events per minute, there isn't much that's sexy or beautiful about data management:  it's logistics, like working in the supply department.  Plus, it's hard:  managing data from dozens of undocumented interfaces, subject to change without notice at any point in time, subject to data loss, data duplication, and data corruption, then needing to transform, integrate, validate and possibly reconcile this data is a surprisingly difficult problem.

Besides the lack of glamor, the data logistics also suffers from a very high cost: complex data requires extensive time for analysis, development, and testing, large data requires extensive time to develop around performance needs.  Both require time to be spent on logging, auditing, etc.  This all adds up.   But here's the thing - it's relatively predictable.  Perhaps not to the number of hours required to build one, but often to the number of days.

Analysis Impact

And more importantly - broad data with simple analysis produces about the same value as less data with complex analysis:
  • At a major telco I built a pair of data warehouses - one for finance and one for inventory.  Although there were quite a few opportunities to deliver better (more complex) analysis than was previously being used - it was consistently rejected by users who couldn't understand it.    The only improvements accepted were incremental.  However, once we integrated the two sets of data we discovered $150 million dollars in lost assets through the simplest of analysis.   This was obviously not ignored or rejected and was the greatest achievement of those projects.
  • Security reporting is often limited by the lack of reference data.   When looking at anti-virus, vulnerability scanning, and patch data for example one of the very top requests is to know what servers weren't being supported by these tools and why.   And the tools themselves are incapable of answering this question (well, vuln scan can to a limited extent).   However, a simple integration with asset/config data can not only help to answer that question but also shed light on dozens of others - by including production vs test, business process and other attributes in the prior analysis.
Pragmatic Prioritization

At Strata's last session, one of the speakers said two interesting things:
  • 90% of his users could not work with complex visualization
  • Multiple simple analysis was preferable to a single complex analysis
I spoke with facilitator, Drew Conway, about this to see his perspective on the later point since he seems to be more of a math than data guy.   His explanation was that especially with sparse data there is a strong possibility than analysis could be wrong and so sanity-checking is even more critical than normal.   This is a cool insight that I didn't expect, but I'd like to generalize it a bit.   Perhaps sparsity is just another dimension at play in all analysis - and that as it increases the likelihood of errors in analysis correspondingly increase.

The way that I approach analytical projects is typically by iterating through two alternate types of upgrades:   data, analysis, data, analysis, etc.  This typically results in simple analysis on new combinations of data with the potential for more sophisticated analysis to eventually get deployed - but not until all the simple options have been exhausted.   This isn't always the best approach - if there is complex, but proven and high-impact analysis that can be performed on a single set of data then that will often be the cheapest and fastest way to get started.  It's just that in my opinion this happens in a minority of cases.

So, that's what I hope to hear next year at Strata - a pragmatic approach to balancing these two priorities in a way that maximizes the speed of delivery.

No comments:

Post a Comment