Thursday, November 21, 2013

Rise Of Big Data On Cloud


Growing up as an engineer and as a programmer I was reminded every step along the way that resources—computing as well as memory—are scarce. The programs were designed on these constraints. Then the cloud revolution happened and we told people not to worry about scarce computing. We saw rise of MapReduce, Hadoop, and countless other NoSQL technology. Software was the new hardware. We owe it to all the software development, especially computing frameworks, that allowed developers to leverage the cloud—computational elasticity—without having to understand the complexity underneath it. What has changed in the last two to three years is a) the underlying file systems and computational frameworks have matured b) adoption of Big Data is driving the demand for scale out and responsive I/Os in the cloud.

Three years back, I wrote a post, The Future Of The BI In Cloud where I had highlighted two challenges of using cloud as a natural platform for Big Data. The first one was to create a large scale data warehouse and the second was lack of scale out computing for I/O intensive applications.

A year back Amazon announced RedShift, a data warehouse service in the cloud, and last week they announced high I/O instances for EC2. We have come a long way and more and more I look at the current capabilities and trends, Big Data, at scale, on the cloud, seems much closer to reality.

From a batched data warehouse to interactive analytic applications:

Hadoop was never designed for I/O intensive applications, but Hadoop being a compelling computational scale out platform developers had a strong desire to use it for their data warehousing needs. This made Hive and HiveQL popular analytic frameworks but this was a sub optimal solution that worked well for batch loads and wasn't suitable for responsive and interactive analytic applications. Several vendors realized there's no real reason to stick to the original style of MapReduce. They still stuck to the HDFS but significantly invested into alternatives to Hive which are way faster.

There are series of such projects/products that are being developed on HDFS and MapReduce as a foundation but by adding special data management layers on top of it to run interactive queries much faster compared to plain vanilla Hive. Some of those examples are Impala from Cloudera and Apache Drill from MapR (both based on Dremel), HAWQ from EMC, Stinger from Hortonworks and many other start-ups. Not only vendors but the early adopters such as Facebook created Hive projects such as Presto, an accelerated Hive, which they recently open sourced.

From raw data access frameworks to higher level abstraction tools: 

As vendors continue to build more and more Hive alternatives I am also observing vendors investing in higher level abstraction frameworks. Pig was amongst those first higher level frameworks that made it easier to express data analysis programs. But, now, we are witnessing even higher layer rich frameworks such as Cascading and Cascalog not only to write SQL queries but write interactive programs in higher level languages such as Clojure and Java. I'm a big believer in empowering developers with right tools. Working directly against Hadoop has a significant learning curve and developers often end up spending time on plumbing and other things that can be abstracted out in a tool. For web development, popularity of Angular and Bootstrap are examples of how right frameworks and tools can make developers way more efficient not having to deal with raw HTML, CSS, and Javascript controls.

From solid state drives to in-memory data structures: 

Solid state drives were the first step in upstream innovation to make I/Os much faster but I am observing this trend go further where vendors are investing into building in-memory resident data management layers on top of HDFS. Shark and Spark are amongst the popular ones. Databricks has made big bets on Spark and recently raised $14M. Shark (and hence Spark) is designed to be compatible with Hive but designed to run queries 100x times faster by using in-memory data structures, columnar representation, and optimizing MapReduce not to write intermediate results back to disk. This looks a lot like MapReduce Online which was a research paper published a few years back. I do see a UC Berkeley connection here.

Photo courtesy: Trey Ratcliff