2010-12-26

Mashups vs Data Warehouses

Mashups have come from the application side of IT, warehouses from the data side. They overlap quite a bit - but there's not a lot of thought of how to leverage the best of both worlds.

2010-12-23

Parallel Database and Hadoop Costs

In all the hype around Hadoop, and maybe the "micro-hype" around parallel databases it's pretty easy to find exciting anecdotes to support these architectures: numbers of nodes in a cluster, speed to calculate or move data, etc. Finding the costs is much more difficult - and without the costs how does someone make a decision on the merits of the solution?

2010-12-21

How absolute are absolutes?

When modeling a system that needs to last a long time or be used by a very large number of users - you can run into problems with absolutes. It turns out that very few aren't subject to occasional change or are otherwise subject to interpretation:
  • weight - the kilogram is based on a reference original as well as forty copies. However, simply touch or dropping one can make subtle differences in weight - and in 2009 it was discovered that none of these weight exactly the same amount - each are off by a handful of micrograms. Some of these have grown steadily lighter in weight over time.
  • length - the meter was initially defined in Paris in 1790 as the length of a pendulum with a half period of one second. In 1791 it was changed to be one millionth of the earth's merridan along a quadrant running through Paris. In 1793 the previous definition was discovered to be long by one fifth of a millimeter. Between 1795 and 1799 two reference bars were created (first brass then platinum). Then between 1889 and 2002 the definition was changed five times to consider temperature, gravity, and atmosphere, to be based on atomic decay, then be based on a the speed of light over a fraction of a second.

What's Missing From Data Science? or A Partial Review of Data Analysis with Open Source Tools


I just got my copy of Data Analysis with Open Source Tools - by Phillip K. Janert, published this year by O'Reilly.

I'm very excited about this getting this book since it addresses my weak area in data science and business intelligence: the math. The structure of the book is great, the products that he covers are perfect - NumPy, matplotlib, R. gnuplot, etc. His writing style is fine. I'm looking forward to really digging into this more than any technical book I've come across in a few years.

As I slowly work through this book I'll analyze the content more thoroughly. But, in the meanwhile, I've done a spot-checking on his coverage of reporting, business intelligence and data. And the gaps that I've found between the state of these domains and what the book describes - is completely consistent with what I've generally found in the data science writing: a vast lack of understanding of BI.

Premature Optimization of Analytical Architectures

I've been noticing a pattern over the last year - more developers talking about technology optimized for millions of users, and fewer developers talking about technology optimized for hundreds or thousands of users.

Since most engineering decisions are a matter of compromises - a billion user requirement results in a loss somewhere else. Maybe data quality, maybe software maturity, maybe functionality, maybe hosting costs. And some of these compromises may interfere with the project getting to 10000 users.