Thursday, October 28, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 1

I have tremendous respect for Michael Stonebraker. He is an apt visionary. What I like the most about him is his drive and passion to commercialize the academic concepts. ACM recently published his article “My Top 10 Assertions About Data Warehouses." If you haven’t read it, I would encourage you to read it.

I agree with some of his assertions and disagree with a few. I am grounded in reality, but I do have a progressive viewpoint on this topic. This is my attempt to bring an alternate perspective to the rapidly changing BI world that I am seeing. I hope the readers take it as constructive criticism. This post has been sitting in my draft folder for a while. I finally managed to publish it. This is Part 1 covering the assertions 1 to 5. The Part 2 with the rest of the assertions will follow in a few days.

“Please note that I have a financial interest in several database companies, and may be biased in a number of different ways.”

I appreciate Stonebraker’s disclaimer. I do believe that his view is skewed to what he has seen and has invested into. I don’t believe there is anything wrong with it. I like when people put money where their mouth is.

As you might know, I work for SAP, but this is my independent blog and these are my views and not those of SAP’s. I also try hard not to have SAP product or strategy references on this blog to maintain my neutral perspective and avoid any possible conflict of interest.

Assertion 1: Star and snowflake schemas are a good idea in the data warehouse world.

This reads like an incomplete statement. The star and snowflake schemas are a good idea because they have been proven to perform well in the data warehouse world with row and column stores. However, there are emergent NoSQL based data warehouse architectures I have started to see that are far from a star or a snowflake. They are in fact schemaless.

“Star and Snowflake schemas are clean, simple, easy to parallelize, and usually result in very high-performance database management system (DBMS) applications.”

The following statement contradicts the statement above.

“However, you will often come up with a design having a large number of attributes in the fact table; 40 attributes are routine and 200 are not uncommon. Current data warehouse administrators usually stand on their heads to make "fat" fact tables perform on current relational database management systems (RDBMSs).”

There are a couple of problems with this assertion:
  1. The schema is not simple; 200 attributes, fact tables, and complex joins. What exactly is simple?
  2. Efficient parallelization of a query is based on many factors, beyond the schema. How the data is stored and partitioned, performance of a database engine, and hardware configuration are a few to name.
"If you are a data warehouse designer and come up with something other than a snowflake schema, you should probably rethink your design.”

Really?

The requirement, that the schema has to be perfect upfront, has introduced most of the problems in the BI world. I call it the design time latency. This is the time it takes after a business user decides what report/information to request and by the time she gets it (mostly the wrong one.) The problem is that you can only report based what you have in your DW and what’s tuned.

This is why the schemaless approach seems more promising as it can cut down the design time latency by allowing the business users to explore the data and run ad hoc queries without locking down on a specific structure.

Assertion 2: Column stores will dominate the data warehouse market over time, replacing row stores.

This assertion assumes that there are only two ways of organizing data, either in a row store or in a column store. This is not true. Look at my NoSQL explanation above and also in my post “The Future Of BI In The Cloud”, for an alternate storage approach.

This assertion also assumes that the access performance is tightly dependent on how the data is stored. While this is true in the most cases, many vendors are challenging this assumption by introducing an acceleration layer on top of the storage layer. This approach makes is feasible to achieve consistent query performance, by clever acceleration architecture, that acts as an access layer, and does not depend on how data is stored and organized.

“Since fact tables are getting fatter over time as business analysts want access to more and more information, this architectural difference will become increasingly significant. Even when "skinny" fact tables occur or where many attributes are read, a column store is still likely to be advantageous because of its superior compression ability."

I don’t agree with the solution that we should have fatter fact tables when business analysts want more information. Even if this is true, how will column store be advantageous when the data grows beyond the limit where compression isn’t that useful?

“For these reasons, over time, column stores will clearly win”

Even if it is only about rows versus columns, the column store may not be a clear commercial winner in the marketplace. Runtime performance is just one of many factors that the customers consider while investing in DW and business intelligence.

“Note that almost all traditional RDBMSs are row stores, including Oracle, SQLServer, Postgres, MySQL, and DB2.”

Exactly!

The row stores, with optimization and acceleration, have demonstrated reasonably good performance to stay competitive. Not that I favor one over the other, but not all row-based DW are that large or growing rapidly, and have serious performance issues, warranting a switch from a row to a column.

This leads me to my last issue with this assertion. What about a hybrid store – row and column? Many vendors are trying to figure this one out and if they are successful, this could change the BI outlook. I will wait and watch.

Assertion 3: The vast majority of data warehouses are not candidates for mainmemory or flash memory.

I am assuming that he is referring to the volatile flash memory and not flash memory as storage. Though, the SSD block storage have huge potential in the BI world.

“It will take a long time before main memory or flash memory becomes cheap enough to handle most warehouse problems.”

Not all DW are growing at the same speed. One size does not fit all. Even if I agree that the price won’t go down significantly, at the current price point, main memory and flash memory can speed up many DW without breaking the bank.

The cost of DW, and especially the cost of flash memory, is a small fraction of the overall cost; hardware, license, maintenance, and people. If the added cost of flash memory makes business more agile, reduces maintenance cost, and allows the companies to make faster decisions based on smarter insights, it’s worth it. The upfront capital cost is not the only deciding factor for BI systems.

“As such, non-disk technology should only be considered for temporary tables, very "hot" data elements, or very small data warehouses.”

This is easier said than done. The customers will spend significant more time and energy, on a complicated architecture, to isolate the hot elements and running them on a different software/hardware configuration.

Assertion 4: Massively parallel processor (MPP) systems will be omnipresent in this market.

Yes, MPP is the future. No disagreements. The assertion is not about on-premise or the cloud, but I truly believe that cloud is the future for MPP. There are other BI issues that need to be addressed before cloud makes it a good BI platform for a massive scale DW, but the cloud will beat any other platform when it comes to MPP with computational elasticity.

Assertion 5: "No knobs" is the only thing that makes any sense.

“In other words, look for "no knobs" as the only way to cut down DBA costs.”

I agree that “no knobs” is what the customers should thrive for to simplify and streamline their DW administration, but I don’t expect these knobs to significantly drive down the overall operational cost, or even the cost just associated with the DBAs. Not all the DBAs have a full time job to manage and tune the DW. The DW deployments go through a cycle where the tasks include schema design, requirements gathering, ETL design etc. Tuning or using the “knobs” is just one of many tasks that the DBAs perform. I absolutely agree that the no-knobs would certainly take some burden off the shoulders of a DBA, but I disagree that it would result into significant DBA cost-savings.

For a fairly large deployment, there is significant cost associated with the number of IT layers
that are responsible to channel the reports to the business users. There is an opportunity to invest into the right kind of architecture, technology-stack for the DW, and the tools on top of that to help increase the ratio of Business users to the BI IT. This should also help speed up the decision-making process based on the insights gained from the data. Isn’t that the purpose to have a DW to begin with? I see the self-service BI as the only way to make IT scale. Instead of cutting the DBA cost, I would rather focus on scaling the BI IT with the same budget and a broader coverage amongst the business users in an organization.

No comments: