Friday, March 5, 2010

NoSQL Is Not SQL And That’s A Problem

I do recognize the thrust behind the NoSQL movement. While some are announcing an end of era for MySQL and memcached others are questioning the arguments behind Cassandra’s OLTP claims and scalability and universal applicability of NoSQL. It is great to see innovative data persistence and access solutions that challenges the long lasting legacy of RDBMS. Competition between HBase and Cassandra is heating up. Amazon now supports a variety of consistency models on EC2.

However none of the NoSQL solutions solve a fundamental underlying problem – a developer upfront has to pick persistence, consistency, and access options for an application.

I would argue that RDBMS has been popular for the last 30 years because of ubiquitous SQL. Whenever the developers wanted to design an application they put an RDBMS underneath and used SQL from all possible layers. Over a period of time the RDBMS grew in functions and features such as binary storage, faster access, clusters etc. and the applications reaped these benefits.

I still remember the days where you had to use a rule-based optimizer to teach the database how best to execute the query. These days the cost-based optimizers can find the best plan for a SQL statement to take guess work out of the equation. This evolution teaches us an important lesson. The application developers and to some extent even the database developers should not have to learn the underlying data access and optimization techniques. They should expect an abstraction that allows them to consume data where consistency and persistence are optimized based on the application needs and the content being persisted.

SQL did a great job as a non-procedural language (what to do) against many past and current procedural languages (how to do). SQL did not solve the problem of staying independent of the schema. The developers did have to learn how to model the data. When I first saw schema-less data stores I thought we would finally solve the age-old problem of making an upfront decision of how data is organized. We did solve this problem but we introduced a new problem - lack of ubiquitous access and consistency options for schema-less data stores. Each of these data stores came with its own set of access API that are not necessarily complicated but uniquely tailored to address parts of the mighty CAP theorem. Some solutions even went further and optimized on specific consistencies such as eventually consistency, weak consistency etc.

I am always in favor of giving more options to the developers. It’s usually a good thing. However what worries me about NoSQL is that it is not SQL. There isn’t simply enough push for ubiquitous and universal design time abstractions. The runtime is certainly getting better, cheaper, faster but it is directly being pushed to the developers skipping a whole lot of layers in between. Google designed BigTable and MapReduce. Facebook took the best of BigTable and Dynamo to design Cassandra, and Twitter wanted scripting against programming on Hadoop and hence designed Pig. These vendors spent significant time and resources for one reason – to make their applications run faster and better. What about the rest of the world? Not all applications share the same characteristics as Facebook and Twitter and certainly enterprise software is quite different.

I would like to throw out a challenge. Design a data store that has ubiquitous interface for the application developers and is independent of consistency models, upfront data modeling (schema), and access algorithms. As a developer you start storing, accessing, and manipulating the information treating everything underneath as a service. As a data store provider you would gather upstream application and content metadata to configure, optimize, and localize your data store to provide ubiquitous experience to the developers. As an ecosystem partner you would plug-in your hot-swappable modules into the data stores that are designed to meet the specific data access and optimization needs of the applications.

Are you up for the challenge?

4 comments:

Alok Bhargava said...

An excellent article but that left me a little baffled about the challenge. Why would you require independence of consistency models as part of the challenge? It seems that given the impossibility of working around the CAP theorem, consistency is the objective chosen to sacrifice (primarily for business reasons) in order to build a working system. So why would building consistency-safe systems for general use be the goal? Are you suggesting that such a system be designed so that it could be used as an application/service by others? Would it not concern the user's how their consistency issues are resolved by the underlying infrastructure?

Thanks very much for the writeup!

Dheeraj Saxena said...

I would like to add a twist to the argument which to me seems to suggest that developers and designers only have a mutually exclusive choice when it comes to RDBMS vs. NoSql. Either fix and design your schema ahead of time and then run around designing/maintaining caching/optimization layers or dump it all in some document-oriented key/value store and then scratch your head when it is time to maintain consistency, transactionality and most importantly-reporting and analytics. I would opine that a more astute architectural approach would be to contextually combine the legacy mentality(and experience)of using pure play RDBMS based data stores with the flexibility and out of the box performance of NoSql engines. Some sort of an approach taken by Content Management Systems that maintain metadata in relational tables while the actual content can be stored in a variety of formats. Now I am no insider to Facebook or Twitter but I doubt that if I purchase virtual currency on Facebook that my information would be dumped in a couchdb/mongodb/hbase powered database or if I view my profile page on Twitter that the information is being dynamically generated from multiple RDBMS tables at runtime.
To address the larger theme behind Chirag's original comment '....a developer upfront has to pick persistence, consistency, and access options for an application.....'-I guess one way forward could be to micro-model transactional/analytical metadata requirements into an RDBMS while keeping a NoSql engine for storing 'dataset dumps'. The ingenuity and relative competitive advantage of an application would likely lie in the quality of metadata stores(which presumably would involve much lesser upfront effort that designing an entire RDBMS schema) in such hybrid architectures

Unknown said...

Great Information shared. Technology is very much developed. I am very much interested in computing field, thats why am collecting information about it. I have planned to attend the upcoming 2nd Annual Virtual Conference which is going to be hosted online march 2010. I believe i would be benefited much with that cloud computing conference.

Watchdog said...

Chirag,

You have laid down the gauntlet we picked up two years ago!

GenieDB was designed to provide developers, like ourselves, a way of building apps that scale-out invisibly, without us having to worry about eventual consistency after porting parts or all of our applications to webscale / NoSQL architecture. SQL & NoSQL working togther, scaling togther, without changing the way we work... and there's a working GenieDB for MySQL storage engine in Beta to prove our point.

I know you saw our #UTR presentation on this very subject. The full details of how we've achieved an engineering solution to the CAP problem are in our whitepapers on www.geniedb.com.

Thanks again for your very kind tweets regarding our pitch :). It was tough out there!

Very best,

Jack Kreindler (Founder, GenieDB)