cloud computing: April 2009

Tuesday, April 28, 2009

Pre-seeding CRM With Data-as-a-service To Accelerate Adoption

I had discussion with Jim Fowler, the CEO of JigSaw, a couple of weeks back where he walked me through their new offering, Data Fusion, that JigSaw announced today. Data Fusion is a data-as-a-service offering that allows Salesforce.com customers to buy a large list of prospects with detailed verified contact information provided by JigSaw. There are plenty of legal and ethical issues around how JigSaw acquires the business contact information. Michael Arrington does not like JigSaw and Rafe Needleman calls it one of the creepiest products that he has ever seen. I don't want to argue about these ethical and legal aspects. I would let the other people, users, and customers sort that out.

I find the idea of acquiring such a list to pre-seed a CRM instance with the vetted data is an interesting one that utilizes data-as-a-service. A pre-seeded CRM instance speeds up the adoption of the tool inside an organization since suddenly sales people start seeing value in the tool and are willingly to invest their time into it. This could cause similar kinds of network effects that a social network causes where more people and more data bring in a lot more data. The sales people inside an organization typically view CRM system as an administrative overhead since they don't get any value out of it. They are compensated based on the deals closed and not the data being correct. This dynamics affect the adoption of a CRM tool and the quality of data in it. Pre-seeding a CRM instance with some good data and keeping it clean moving forward could solve this problem.

Jim Fowler says if the organizations have made a conscious decision not to own software by moving to SaaS why should they even own data? That's certainly an interesting take on data-as-a-service to enable SaaS 2.0. There has been an ongoing tension between LOBs and IT since most of the SaaS purchase decisions are initially driven by LOBs and IT is brought into the discussion late in the game. LOB will continue looking for an easy solution that meets their needs and does not require upfront IT involvement. The data-as-a-service certainly has an added value on top of SaaS. I can imagine some "come to Jesus" meetings between LOBs and IT. I would love to be a fly on the wall for some of those meetings!

Friday, April 24, 2009

Database Continuum On The Cloud - From Schemaless To Full-Schema

A recent paper by Mike Stonebraker and others compared relational and columnar database in a parallel configuration with MapReduce. The paper concludes that MapReduce is an easy to configure and easy to use option where as the other data stores, relational and columnar databases, pay the upfront price of organizing the data but outperform MapReduce in the runtime performance. This study does highlight the fact that a chosen option does not necessarily dictate or limit the scale as long as the other attributes such as an effective parallelism algorithm, B-tree indices, main-memory computation, compression etc. can help achieve the desired scale.

The real issue, which is not being addressed, is that even if the chosen approach does not limit the scale it still significantly impacts the design-time decisions that developers and architects have to make. These upfront decisions limit the functionality of the applications built on these data store and reduces the overall design-agility of the system. Let's look at the brief history of the evolution of DBMS, a data mining renaissance, and what we really need to design a data store that makes sense from the consumption and not the production view point.

Brief history of evolution of DBMS

Traditionally the relational database systems were designed to meet the needs of transactional applications such as ERP, SCM, CRM etc. also known as OLTP. These database systems provided row-store, indexes that work for selective queries, and high transactional throughput.

Then came the BI age that required accessing all the rows but fewer columns and had the need to apply mathematical functions such as aggregation, average etc. on the data that was being queried. Relational DBMS did not seem to be the right choice but the vendors figured out creative ways to use the same relational DBMS for the BI systems.

As the popularity of the BI systems and the volume of data grew two kinds of solutions emerged - one that still used the relational DBMS but accelerated the performance via innovative schema and specialized hardware and the other kind, columnar database, that used column-store instead of row-store. A columnar DBMS stores data grouped in columns so that a typical BI query can read all the rows but fewer columns in single read operation. Columnar vendors also started adding compression and main-memory computation to accelerate the runtime performance. The overall runtime performance of BI systems certainly got better.

Both the approaches, row-based and columnar, still required ETL - a process to extract data out of the transactional systems, apply some transformation functions, and load data into a separate BI store. They did not solve the issue of "design latency" - upfront time consumed to design a BI report due to the required transformation and a series of complicated steps to model a report.

Companies such as Greenplum and Aster Data decided to solve some of these legacy issues. Greenplum provides design-time agility by adopting a dump-all-your-data approach to apply the transformation on the fly only when needed. Aster Data has three layers to address the query, load, and execute aspects of the data. These are certainly better approaches that uses the parallelism really well and has cloud-like behavior but are still designed to patch up the legacy issues and do not provide clean design-time data abstraction.

What do we really need?

MapReduce is powerful since it is extremely simple to use. It has only three functions - map, split, and reduce. Such schemaless approaches have lately grown popularity due to the fact that developers don't want to lock themselves into a specific data model. They also want to explore adhoc computing before optimizing the performance. There are also extreme scenarios such as FriendFeed using relational database MySQL to store schema-less data. MapReduce has very low barrier to entry to get started. On the other hand a fully-defined schema approach by relational and columnar DBMS offers great runtime performance once the data is loaded and indexed for transactional access and executing BI functions such as aggregation, average, mean etc.

What we really need is a continuum from a schemaless to a full schema database based on the context, action, and access patterns of the data. A declarative abstracted persistence layer to access and manipulate the database that is optimized locally for various actions and access patterns is the right approach. This will allow the developers to fetch and manipulate the data independent of the storage and access mechanism. For example, developers can design an application where single page can perform a complex structured and unstructured search, create a traditional transaction, and display rich analytics information from single logical data store without worrying about what algorithms are being used to fetch and store data and how the system is designed to scale. This might require a hybrid data store architecture that optimizes the physical storage of data for certain access patterns and uses redundant storage replicated in real-time and other mechanisms such as accelerators for other patterns to provide unified data access to the applications upstream.

Schemaless databases such as SimpleDB, CouchDB, and Dovetail are in their infancy but the cloud makes it a good platform to support the key requirements of schemaless databases - incremental provisioning and progressive structure. Cloud also makes it a great platform for the full-schema DBMS by offering utility-style incremental computing to accelerate the runtime performance. A continuum on the cloud may not be that far-fetched after all.

Friday, April 10, 2009

Amazon's Re-designed Review System Generates More Revenue But Has Plenty Of Untapped Potential

Amazon's design tweaks to its review system has resulted into $2.7 billion of new revenue argues Jared Spool. Other people have also picked up this story with their analysis. I am wary of absolute revenue numbers tied to a feature to derive lost opportunity cost since a variety of other things could have driven the sale. It is wrong to assume that people would not have bought the products had the feature not existed. However I do believe it is a great step in the direction of making the review system more useful and drive more clickthroughs and conversions. Simply the presence of the reviews, magic number 20 in this case, motivates consumers to drill down into the details of a product and its reviews.

Amazon has made significant progress in collaborative filtering through their review system and it is an exemplary of a long tail business model. It has helped consumers to gain transparency and has also helped expose issues with the products. This is not enough. As an e-commerce market leader I would want Amazon to continue innovating around their review system. This is what I specifically would like to see in Amazon's review system:

Mining social media channels: Amazon.com is not the only place where consumers talk about the products. Consumers discuss product features and frustrations on Facebook, Twitter, and other social media outlets. Amazon has an opportunity to provide unified product review experience, a tool similar to ConvoTrack, by tapping into these social media channels for all the product conversations.

Tag cloud as a visual filter: One of the ways to make sense out of large number of reviews is to generate a tag cloud from the raw text of the reviews. A tag cloud acts as a great visual filter to narrow down the reviews that the consumers are looking for e.g looking only at rebooting issues and not anything else while buying a router.

Provide diverse search options: I want to search for the routers that have 4 or 5 stars ratings in the last 6 months. I cannot do that today. This search criteria makes sense. Manufacturers fix defects via firmware updates and models tend to improve as they mature. If the item had many negative reviews early on there is no way to find out without reading the other positive reviews whether the issues have been fixed or not. Higher recent ratings tend to correlate with mature product and satisfied customers.

Re-think one-size-fits-all format: All the products sold on Amazon ranging from a book to a TV has the exact same review format. It does not have to be that way. The book reviews tend to be more subjective and philosophical where the gadget reviews are generally more fact-based e.g watch out this monitor does not come with a DVI cable. Re-thinking the format for the types of products being sold make sense e.g pros and cons section for the gadgets, similar books to the one that I am reviewing etc.

Incentivise people to write reviews: Few days after consumers receive a product ask them whether they are satisfied with their purchase or not. Incentivize them to write reviews on the product; not only this helps generating more reviews per product but it also brings people back to Amazon to make more purchases. Make promotional email personal and relevant e.g.

How are you liking the "Tipping Point"? Malcom Gladwell has authored his latest book called "Outliers" and we are positive you will enjoy that as well. Would you mind writing a brief review of "Tipping Point" and we will discount the Outliers for you by 5%.

Closed-loop feedback channel: The current comments structure does not allow the manufacturers, authors, and the publishers to identify themselves and clarify the features, issues, and respond to consumers' concerns. The reviews are a great platform and a closed-loop feedback channel for the vendors to converse with the consumers. Amazon could certainly extend the review system to help create a dialogue between the consumers and the manufacturers.

Monday, April 6, 2009

Accelerating Social Computing: Web 2.0 + Cloud = Web²

I was at the Web 2.0 expo in San Francisco last week. It was not very different from the previous year except that I could see the impact of slow economy - shrinking attendance, less crowded booths, and "Hire Me" ribbons. Tim O'Reilly's keynote was interesting. He said that Web 2.0 was never about the version number (read, he does not like people calling Web 3.0 a successor of Web 2.0). He had the equation Web 2.0 + World = Web Squared. I changed it to Web 2.0 + Cloud = Web². The cloud seems more appropriate and the superscript is much cooler. If this catches on, remember, you read it here first!

The biggest shift that I have observed in Web 2.0 is the exponential growth of social media. This was evident at the Web 2.0 expo by looking at the number of participating social computing companies. Web 2.0 is certainly taking the direction of social computing. Tim mentioned in his keynote that the immense data gathered by the sensors and other means have hidden meaning in it and the applications have begun to understand this meaning. I could not agree any more; that justifies replacing the word "World" with "Cloud". Amazon's recent announcement to offer MapReduce on EC2 and Cloudera's $5M series A funding to commercialize Hadoop are early indicators of the rising demand for data-centric massive parallel processing. The cloud is a natural enabler to this evolution that will help gather data, context, and the interactions to amplify social conversations and create network effects. As John Maeda in his keynote said - people want to be human again. As Bill Buxton says :

User-centered design commonly tries to take into account different canonical user types through the use of persona. Perhaps one thing we need to do is to augment this tool with the notion of "placona," that is, capturing the canonical set of physical and social spaces within which any activity we are trying to support might be situated. After all, cognition does not reside exclusively in the brain. Rather, it is also distributed in the space in which we exercise that knowledge—in the location itself, the tools, devices, and materials that we use, and the people and social context in which all of this exists.

If one of the purposes of design and innovation is to improve our lives—for business, artistic, or familial purposes—then design that does not consider the larger social, cultural, and physical ecosystem is going to miss the mark.

Social computing is fundamentally a distributed problem that requires to make sense out of people's social and physical interactions with other people and objects including the context. The cloud can make this feasible and we can truly accelerate towards Web². I think Tim will most likely drop the 2.0 from Web 2.0 next year - that in itself would be a great first step in leaving Web 2.0 behind and start the journey towards Web².