2010-12-26

Mashups vs Data Warehouses

Mashups have come from the application side of IT, warehouses from the data side. They overlap quite a bit - but there's not a lot of thought of how to leverage the best of both worlds.



Warehouses provide consolidation, validation, standardization, integration, distribution and presentation of data. That's a vast amount of functionality - but since they're a centralized resource they can also become a bottleneck for the enterprise if managed poorly. The two primary challenges include:
  • Data structures may be difficult to change - changes to significant business rules (sale regions for example) may require conversion of vast amounts of historical data. A change of this type can get deferred for years.
  • ETL interfaces may be slow to add - assuming that each ETL interface takes 1-2 months, a six to twelve month backlog can easily develop due to a loss in staffing, sudden surge in needed interfaces, etc.
Mashups provide integration and presentation of data. Since they're decentralized, they are far less likely to slow down the enterprise. But they have much less functionality - mostly due to their lightweight integration and lack of a cleansed, standardized, historical set of data:
  • Very little data actually has common keys: most systems that serve as a source of record are developed without much thought about strategic data sharing, are developed independently, may be picked up through an acquisition. So, accounts, customers, sites, products, services, users, departments, assets, tickets, etc - are likely to have completely different ids in different systems.
  • Business rules are inconsistent across systems: definitions, formats, and rules may be different across systems for either legitimate or incompetent reasons. A legitimate example is the different in the definition of a customer: sales may define a customer as someone as a prospect or someone who has placed an order, finance may define a customer as someone for whom payment has been received, delivery/operations/support may define a customer as someone for whom a service is currently being provided. Attempting to consolidate all orgs into a single definition generally fails. Incompetent reasons can be simply a matter of poor communication between departments, and are the rule rather than the exception in large organizations.
  • Operational systems are not designed for analytical queries - most operational systems are designed to handle some number of very small transactions very quickly. They aren't designed to handle vast queries, and especially not large numbers of them. Performance is usually poor at best, but can easily be far worse than poor and put the availability of the operational process at risk.
  • Operational systems seldom keep historical data - few operational systems keep full historical data due to the impacts to performance, capacity, and development time. However, one of the most common forms of analysis is time-series analysis.
  • Operational systems seldom include vital reference data necessary for powerful analysis - extended attributes that can augment system data is often essential for powerful analysis. However, this also needs to be acquired and staged somewhere.
These solutions are so different that neither is a complete replacement for the other. But they can complement each other well:
  • The fast development capabilities of mashups can compensate for the slow development speed of warehouses.
  • Mashups can combine real-time data from operational systems with high-latency data from the warehouse. Typical data warehouse reporting tools (OLAP, ROLAP, etc) aren't very effective at this without database federation - which can be problematic with complex queries or large data volumes.
  • The rich data functionality of warehouses can compensate for the weak back-end functionality of mashups and their operational sources.
In these cases the mashup produced may be an application maintained indefinitely (esp when mixing warehouse & real-time data), may be an initial version - subject to change as soon as more data is added to the warehouse, or may be a temporary solution until more data is added to the warehouse and the functionality is moved into a more strategic reporting tool (Cognos, Business Objects, Microstrategy, etc). In any case, the cost of combining these technologies may easily overcome the costs of slowing down the business by waiting on the warehouse, or delivering an unstable & incomplete product by not using it.

No comments:

Post a Comment