Premature Optimization of Analytical Architectures

I've been noticing a pattern over the last year - more developers talking about technology optimized for millions of users, and fewer developers talking about technology optimized for hundreds or thousands of users.

Since most engineering decisions are a matter of compromises - a billion user requirement results in a loss somewhere else. Maybe data quality, maybe software maturity, maybe functionality, maybe hosting costs. And some of these compromises may interfere with the project getting to 10000 users.

Probably a few factors involved:
  • Programmer optimism - most innovative people I know are inherently optimistic. If they weren't optimistic then they often wouldn't bother. And perhaps after spending x number of hours creating it naturally increases ones optimism.
  • VC funding, goals and rewards - ok, just a guess, but I'm thinking that many startups and VCs are more interested in gambling on sites that can go very large, rather than ones that might just be moderately successful. I've also heard that VCs are pushing start-ups away from relational databases.
  • The hype-cycle - the fashionable trend is definitely in support of very large big-data applications.
  • False sense of scalability limits in relational databases - the partitioning, optimization and parallelism limitations of MySQL have convinced quite a few that relational databases are very slow at processing large volumes of data. Though MySQL hardly defines the limits of the technology.
  • Career building - ok, there's clearly a personal advantage to be had in working on the newest and most exciting. No good developers are looking forward to working with Visual Basic, COBOL, etc any longer, and I don't think we should be surprised that we follow the incentives.
In the IT space, examples of this kind of thinking I've seen are Hadoop clusters used to query structured data that process such small volumes of data that a single dual-core server with a single direct-attached disk arrays could easily handle the load. Given the fact that some servers can be scaled up to four quad-cores, 128 gbytes of memory and eight direct attach SAS or Fibre arrays - there's plenty of scalability within a single system. When you add to this the cost of managing Hadoop and the hosting cost of the cluster (may be more than $10k/year for hosting in many large enterprises) you can see how a small project may be put at risk by going for the sexy, but unnecessary technology.

Of course, there are other reasons besides scalability that may warrant going for the big-data architecture right out the gate. Testing a potential future architecture, PR, establishing an iron-clad reputation for scalability, etc all have some value. But I think it's best if decisions like this are very deliberate and conscious - more of a calculated risk than an unnecessary one.

But at the end of the day, when practical competes with sexy the outcome is predictable. But not always pretty.

No comments:

Post a Comment