cloud computing: Challenging Stonebraker’s Assertions On Data Warehouses

Check out the Part 1 if you haven’t already read it to better understand the context and my disclaimer. This is the Part 2 covering the assertions from 6 to 10.

Assertion 6: Appliances should be "software only."

“In my 40 years of experience as a computer science professional in the DBMS field, I have yet to see a specialized hardware architecture—a so-called database machine—that wins.”

This is a black swan effect; just because someone hasn’t seen an event occur in his or her lifetime, it doesn’t mean that it won’t happen. This statement could also be re-written as “In my 40 years of experience, I have yet to see a social network that is used by 500 million people.” You get the point. I am the first one who would vote in favor of commodity hardware against a specialized hardware, but there are very specific reasons why the specialized hardware makes sense in some cases.

“In other words, one can buy general purpose CPU cycles from the major chip vendors or specialized CPU cycles from a database machine vendor.”

Specialized machines don’t necessarily mean specialized CPU cycles. I hope the word “CPU cycle” is used as metaphor and not to indicate its literal meaning.

“Since the volume of the general purpose vendors are 10,000 or 100,000 times the volume of the specialized vendors, their prices are an order of magnitude under those of the specialized vendor.”

This isn’t true. The vendors who make general-purpose hardware also make specialized hardware, and no, it’s not an order of magnitude expensive.

“To be a price- performance winner, the specialized vendor must be at least a factor of 20-30 faster.”

It’s a wrong assumption that BI vendors use specialized hardware just because of the performance reasons. The “specialized” in many cases for an appliance is simply a specialized configuration. The appliance vendors also leverage their relationship with the hardware vendors to fine tune the configuration based on their requirements, negotiate a hefty discount, and execute a joint go-to-market strategy.

The enterprise software follows value-based pricing and not cost-based pricing. The price difference between a commodity and a specialized appliance is not just the difference of the cost of hardware that it runs on.

“However, every decade several vendors try (and fail).”

Not sure what is the success criteria behind this assertion to declare someone a winner or a failure. Acquisitions of Netezza, Greenplum, and Kickfire are recent examples of how well the appliance companies have performed. The incumbent appliance vendors are doing great, too.

“Put differently, I think database appliances are a packaging exercise”

The appliances are far more than a packaging exercise. Other than making sure that the software appliance works on the selected hardware, commoditized or otherwise, they provide a black box lifecycle management approach to the customers. The upfront cost of an appliance is a small fraction of the overall money that the customers would end up spending during the entire lifecycle of an appliance and the related BI efforts. The customers do welcome an approach where they are responsible for managing one appliance against five different systems at ten different levels with fifteen different technology stack versions.

Assertion 7: Hybrid workloads are not optimized by "one-size fits all."

Yes, I agree, but that’s not the point. It’s difficult to optimize hybrid workloads for a row or a column store, but it is not as difficult, if it’s a hybrid store.

“Put differently, two specialized systems can each be a factor of 50 faster than the single "one size fits all" system in solution 1.”

Once again, I agree, but it does not apply to all the situations. As I discussed earlier, the performance is not the only criteria that matters in the BI world. In fact, I would argue the opposite. Just because the OLTP and OLAP systems are orthogonal, the vendors compromised everything else to gain the performance. Now that’s changing. Let’s take an example of an operational report. This is the kind of report that only has the value if consumed in realtime. For such reports, the users can’t wait until the data is extracted out of the OLTP system, cleaned up, and transferred into the OLAP system. Yes, it could be 50 times faster, but completely useless, since you missed the boat.

The hybrid systems, the once that combine OLTP and OLAP, are fairly new, but they promise to solve a very specific problem, which is real real-time. While the hybrid systems evolve, the computational capabilities of OLTP and OLAP systems have started to change as well. I now see OLAP systems supporting write-backs with a reasonable throughput and OLTP systems with good BI style query performance, all of these achieved through modern hardware and clever use of architectural components.

Let’s not forget what optimization really is. It means desired functionality at reasonable performance. A real-time report, that takes 10 seconds to run could be far more valuable than a report that runs under ten milliseconds, three days later.

“A factor of 50 is nothing to sneeze at.”

Yes, point taken. :-)

Assertion 8: Essentially all data warehouse installations want high availability (HA).

No, they don’t. This is like saying all the customers want five 9 SLA on the cloud. I don’t underestimate the business criticality of a DW if it goes down, but not all the DW are being used 24x7 and are mission critical. One size doesn’t fit all. And, if your DW is not required to be highly available, you need to ask yourself, whether it is fair for you to pay for the HA architectural cost, if you don’t want it. Tiered SLAs are not new, and tiered HA is not a terrible idea.

Let’s talk about the DWs that do require to be highly available.

“Moreover, there is no reason to write a DBMS log if this is going to be the recovery tactic. As such, a source of run-time overhead can be avoided.”

I am a little confused how this is worded. Which logs are we referring to - the source systems or the target systems? The source systems are beyond the control of a BI vendor. There are newer approaches to design an OLTP system without a log, but that’s not up for discussion for this assertion. If the assertion is referring to the logs of the target system, how does that become a run-time overhead? Traditional DW systems are a read-only system at runtime. They don’t write logs back to the system. If he is referring to the logs while the data is being moved to DW, that’s not really run-time, unless we are referring to it as a hot-transfer.

There is one more approach, NoSQL, where eventual consistency is achieved over a period of time and the concept of a “corrupted system” is going away. Incomplete data is an expected behavior and people should plan for it. That’s the norm, regardless of a system being HA or not. Recently Netflix moved some of its applications to the cloud, where they have designed a background data fixer to deal with data inconsistencies.

HA is not black and white, and there are way more approaches, beyond the logs, to accomplish to achieve desired outcome.

Assertion 9: DBMSs should support online reprovisioning.

“Hardly anybody wants to take the required amount of down time to dump and reload the DBMS. Likewise, it is a DBA hassle to do so. A much better solution is for the DBMS to support reprovisioning, without going offline. Few systems have this capability today, but vendors should be encouraged to move quickly to provide this feature.”

I agree. I would add one thing. The vendors, even today, have a trouble supporting offline provisioning to cater to the increasing load. On-line reprovisioning is not trivial, since in many cases, it requires to re-architect their systems. The vendors typically get away with this, since the most customers don’t do capacity planning in real-time. Unfortunately, traditional BI systems are not commodity where the customers can plug-in more blades when they want and take them out when they don’t.

This is the fundamental premise behind why cloud makes it a great BI platform to address such re-provisioning issues with elastic computing. Read my post “The Future Of BI In The Cloud”, if you are inclined to understand how horizontal scale-out systems can help.

Assertion 10: Virtualization often has performance problems in a DBMS world.

This assertion, and the one before this, made me write the post “The Future Of BI In The Cloud”. I would not repeat what I wrote there, but I will quickly highlight what is relevant.

“Until better and cheaper networking makes remote I/O as fast as local I/O at a reasonable cost, one should be very careful about virtualizing DBMS software.”

Virtualizing I/O is not a solution for large DW with complex queries. However, as I wrote in the post, a good solution is not to make the remote I/O faster, but rather tap into the innovation of software-only SSD block I/O that are local.

“Of course, the benefits of a virtualized environment are not insignificant, and they may outweigh the performance hit. My only point is to note that virtualizing I/O is not cheap.”

This is what a disruption initially looks like. You start seeing good enough value in an approach, for certain types of solutions, that seems expensive for other set of solutions. Over a period of time, rapid innovation and economies of scale remove this price barrier. I think that’s where the virtualization stands, today. The organizations have started to use the cloud for IaaS and SaaS for a variety of solutions including good enough self-service BI and performance optimization solutions. I expect to see more and more innovation in this area where traditional large DW will be able to get enough value out of the cloud, even after paying the virtualization overhead.

Tuesday, November 9, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 2

No comments: