2010-12-23

Parallel Database and Hadoop Costs

In all the hype around Hadoop, and maybe the "micro-hype" around parallel databases it's pretty easy to find exciting anecdotes to support these architectures: numbers of nodes in a cluster, speed to calculate or move data, etc. Finding the costs is much more difficult - and without the costs how does someone make a decision on the merits of the solution?



In the case of the commercial solutions you can get list software and hardware prices - though nobody should pay list price anyway. With Hadoop, there's no software prices and many assume the hardware is negligible since it's free. But if you don't want an IO-bound Hadoop cluster then you're really not talking about commodity hardware. It's probably going to have 12-48 drives and cost at least $8k per node. Some will bypass this and assume that it'll run on a virtual cloud, especially if they don't care too much about performance.

The development and maintenance costs are typically very high. Whether a parallel database or Hadoop - you're looking at specialists or at least enthusiasts. Most analytical solutions involve a large number of system interfaces - and they tend to be maintenance-heavy.

And then there's the hosting cost. I have yet to see a case study bring this up. Organizations like Google and Facebook that are already hosting vast data centers can achieve per-node costs at a fraction of what most organizations can. In my experience, most large mature enterprises are looking at per-node costs between $5k and $15k per year for space, hvac, network, hardware and os support. At an average of $10k/year a 100 node cluster will cost as much as a medium-sized development team. A 1000 node cluster will cost $10m/year - as much as a vast development team. Note that this can easily cost more than the software and hardware costs combined, and in a worst-case scenario cost more than everything else combined.

Of course, start-ups have fewer people, fewer procedures, fewer regulatory controls, shallower pockets - and so far lower costs. But even here, picking the wrong hardware and ending up with a high MTBF on 2000+ disks, frequent patches, etc can put a lot of strain on your typical small start-up admin team (or person).

So, what's the point? Well, this stuff is expensive, and scaling out doesn't come for free. In enterprises with limited and very expensive hosting options going with fewer larger nodes is probably going to be a winning scenario. And one that should play into the hands of parallel databases over Hadoop for volumes into the hundreds of terabytes due to the faster performance per node.

2 comments: