cloud computing

Sunday, June 30, 2013

Celebrating Failures

Being a passionate design thinker I am a big believer in failing fast and failing often. I have taken this one step further; I celebrate one failure every week. Here's why:

You get more comfortable looking for failures, analyzing them, and learn from it

I have sat through numerous post-mortem workshops and concluded that the root causes of failures are usually the same: abstract concepts such as lack of communication, unrealistic scope, insufficient training, and so on. If that’s true, why do we repeat the same mistakes, causing failure to remain a common situation? Primarily because many people find it hard to imagine and react to abstractions, but can relate much better when these concepts are contextualized into their own situation. Post-mortem of a project would tell you what you already suspected; it's hindsight and it's a little too late. I have always advocated a "pre-mortem workshop" to prepare for a failure in the beginning. Visualize all the things that could go wrong by imagining that the project has failed. This gives the team an opportunity to proactively look at risks and prepare to prevent and mitigate them.

Failures just like successes become nothing more than events with different outcomes

A failure or a success is nothing but an event. Just like sports you put in your best effort and still fail because you control your efforts, dedication, and passion but not the outcome. While it is absolutely essential to analyze mistakes and make sure you don't repeat them but in some cases, looking back, you would not have done anything differently. When you look at more failures more often they do tend to become events with different outcomes as opposed to one-off situations that you regret.

It changes your attitude to take more risk because you are not afraid of outcome

When failures are not a one-off event and you are anticipating and celebrating it more often it changes how you think about many things, personally as well as professionally. It helps you minimize regret and not failures.

I don't want to imply failure is actually a good thing. No one really wants to fail and yet failure is the only certainty. But, it's all about failing fast, failing often, and correct the course before it's too late. Each failure presents us with an opportunity to learn from it. Don't waste a failure; celebrate it.

About the picture: I took this picture inside the Notre Dame in Paris. I see lights as medium to celebrate everything: victory of good over evil as celebrated during the Hindu festival Diwali and a candlelight vigil to show support and motivate people for a change.

Thursday, June 13, 2013

Hacking Into The Indian Education System Reveals Score Tampering

Debarghya Das has a fascinating story on how he managed to bypass a silly web security layer to get access to the results of 150,000 ISCE (10th grade) and 65,000 ISC (12th grade) students in India. While lack of security and total ignorance to safeguard sensitive information is an interesting topic what is more fascinating about this episode is the analysis of the results that unearthed score tampering. The school boards changed the scores of the students to give them "grace" points to bump them up to the passing level. The boards also seem to have tampered some other scores but the motive for that tampering remains unclear (at least to me).

I would encourage you to read the entire analysis and the comments, but a tl;dr version is:

32, 33 and 34 were visibly absent. This chain of 3 consecutive numbers is the longest chain of absent numbers. Coincidentally, 35 happens to be the pass mark.
Here's a complete list of unattained marks -
36, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 56, 57, 59, 61, 63, 65, 67, 68, 70, 71, 73, 75, 77, 79, 81, 82, 84, 85, 87, 89, 91, 93. Yes, that's 33 numbers!

The comments are even more fascinating where people are pointing out flaws with his approach and challenging the CLT (central limit theorem) with a rebuttal. If there has been no tampering with the score it would defy the CLT with a probability that is so high that I can't even compute. In other words, the chances are almost zero, if not zero, of this guy being wrong about his inferences and conclusions.

He is using fairly simple statistical techniques and MapReduce style computing to analyze a fairly decent size data set to infer and prove a specific hypothesis (most people including me believed that grace points existed but we had no evidence to prove it). He even created a public GitHub repository of his work which he later made it private.

I am not a lawyer and I don't know what he did is legal or not but I do admire his courage to not post this anonymously as many people in the comments have suggested. Hope he doesn't get into any trouble.

Spending a little more time trying to comprehend this situation I have two thoughts:

The first shocking but unfortunately not surprising observation is: how careless the school boards are in their approach in making such sensitive information available on their website without basic security. It is not like it is hard to find web developers in India who understand basic or even advanced security; it's simply laziness and carelessness on the school board side not to just bother with this. I am hoping that all government as well as non-government institutes will learn from this breach and tighten up their access and data security.

The second revelation was - it's not a terribly bad idea to publicly distribute the very same as well as similar datasets after removing PII (personally identifiable information) from it to let people legitimately go crazy at it. If this dataset is publicly available people will analyze it, find patterns, and challenge the fundamental education practices. Open source has been a living proof of making software more secured by opening it up to public to hack it and find flaws in it so that they can be fixed. Knowing the Indian bureaucracy I don't see them going in this direction. Turns out I have seen this movie before. I have been an advocate of making electronic voting machines available to researchers to examine the validity of a fair election process. Instead of allowing the security researchers to have access to an electronic voting machine Indian officials accused a researcher of stealing a voting machine and arrested him. However, if India is serious about competing globally in education this might very well be the first step to bring in transparency.

Friday, May 31, 2013

Unsupervised Machine Learning, Most Promising Ingredient Of Big Data

Orange (France Telecom), one of the largest mobile operators in the world, issued a challenge "Data for Development" by releasing a dataset of their subscribers in Ivory Coast. The dataset contained 2.5 billion records, calls and text messages exchanged between 5 million anonymous users in Ivory Coast, Africa. Various researchers got access to this dataset and submitted their proposals on how this data can be used for development purposes in Ivory Coast. It would be an understatement to say these proposals and projects were mind-blowing. I have never seen so many different ways of looking at the same data to accomplish so many different things. Here's a book [very large pdf] that contains all the proposals. My personal favorite is AllAborad where IBM researchers used the cell-phone data to redraw optimal bus routes. The researchers have used several algorithms including supervised and unsupervised machine learning to analyze the dataset resulting in a variety of scenarios.

In my conversations and work with the CIOs and LOB executives the breakthrough scenarios always come from a problem that they didn't even know existed or could be solved. For example, the point-of-sale data that you use for your out-of-stock analysis could give you new hyper segments using clustering algorithms such as k-means that you didn't even know existed and also could help you build a recommendation system using collaborative filtering. The data that you use to manage your fleet could help you identify outliers or unproductive routes using SOM (self organizing maps) with dimensionality reduction. Smart meter data that you use for billing could help you identify outliers and prevent thefts using a variety of ART (Adoptive Resonance Theory) algorithms. I see endless scenarios based on a variety of unsupervised machine learning algorithms similar to using cell phone data to redraw optimal bus routes.

Supervised and semi-supervised machine learning algorithms are also equally useful and I see them complement unsupervised machine learning in many cases. For example, in retail, you could start with a k-means to unearth new shopping behavior and end up with Bayesian regression followed by exponential smoothing to predict future behavior based on targeted campaigns to further monetize this newly discovered shopping behavior. However, unsupervised machine learning algorithms are by far the best that I have seen—to unearth breakthrough scenarios—due to its very nature of not requiring you to know a lot of details upfront regarding the data (labels) to be analyzed. In most cases you don't even know what questions you could ask.

Traditionally, BI has been built on pillars of highly structured data that has well-understood semantics. This legacy has made most enterprise people operate on a narrow mindset, which is: I know the exact problem that I want to solve and I know the exact question that I want to ask, and, Big Data is going to make all this possible and even faster. This is the biggest challenge that I see in embracing and realizing the full potential of Big Data. With Big Data there's an opportunity to ask a question that you never thought or imagined you could ask. Unsupervised machine learning is the most promising ingredient of Big Data.

Wednesday, May 22, 2013

Lead, Follow, Or Get Out Of The Way

If you have been following this blog you would know that I mainly blog about enterprise software, cloud, and big data with a few occasional posts on design and design thinking. That's what I am most passionate about. Having spent my entire career building enterprise software I have realized that success and competitive differentiation in market place boil down to an organization's unique ability to get three things right where management plays a key role: 1) people who can continuously learn and adapt to change 2) processes that are nimble and evolve as the company evolves 3) products that solve a real problem and delight the end users. While I continue to blog about enterprise software I have decided to evolve this blog further by adding a few management posts going forward.

There are a series of management topics that I am interested in but let's start with the basic one which is about my core management philosophy. My management philosophy is "lead, follow, or get out of the way." In any situation I ask myself whether I should be leading in this situation or following someone's lead and extend my full support to do so. If neither make sense I simply get out of the way and let people do their job. Building, selling, and supporting software, like many other things, require a loosely-connected (to put it in software terms) organization where there are leaders who lead and follow other leaders at the same time. This gets more and more complicated as the size and portfolio of an organization grow over years. People draw artificial boundaries and lose sight of the mission and the big picture.

Leading is hard, following is harder, and getting out of the way is the hardest which requires a conscious attempt to empower people to do their job without getting into their way. But, it is an approach that does work and I encourage you to try it out and share it with others.

Photo courtesy: Pison Jaujip

Tuesday, April 30, 2013

Justifying Big Data Investment

Traditionally companies invest into software that has been proven to meet their needs and has a clear ROI. This model falls apart when disruptive technology such as Big Data comes around. Most CIOs have started to hear about Big Data and based on their position on the spectrum of conservative to progressive they have either started to think about investing or have already started investing. The challenge these CIOs face is not so much whether they should invest into Big Data or not but what they should do with it. Large companies have complex landscapes that serve multiple LOBs and all these LOBs have their own ideas about what they want to get out of Big Data. Most of these LOB executives are even more excited about the potential of Big Data but are less informed about the upstream technical impact and the change of mindset that IT will have to go through to embrace it. But, these LOBs do have a stronger lever - money to spend if they see that technology can help them accomplish something that they could not accomplish before.

As more and more IT executives get excited over the potential of Big Data they are underestimating the challenges to get access to meaningful data in single repository. Data movement has been one of the most painful problems of a traditional BI system and it continues to stay that way for Big Data systems. A vast majority of companies have most of their data locked into their on-premise systems. Not only it is inconvenient but it's actually impractical to move this data to the cloud for the purposes of analyzing it if Big Data platform happens to be a cloud platform. These companies also have a hybrid landscape where a subset of data resides in the cloud inside some of the cloud solutions that they use. It's even harder to get data out from these systems to move it to either a cloud-based or an on-premise Big Data platform. Most SaaS solutions are designed to support ad hoc point-to-point or hub and spoke REST-ful integration but they are not designed to efficiently dump data for external consumption.

Integrating semantics is yet another challenge. As organizations start to combine several data sources the quality as well as the semantics of data still remain big challenges. Managing semantics for single source in itself isn't easy. When you add multiple similar or dissimilar sources to the mix this challenge is further amplified. It has been the job of an application layer to make sense out of underlying data but when that layer goes away the underlying semantics become more challenging.

If you're a vendor you should work hard thinking about business value of your Big Data technology - not what it is to you but what it could do for your customers. The spending pie for customers hasn't changed and coming up with money to spend on (yet another) technology is quite a challenge. My humble opinion on this situation is that vendors have to go beyond technology talk and start understanding the impact of Big Data, understand the magnitude of these challenges, and then educate customers on the potential and especially help them with a business case. I would disagree with people who think that Big Data is a technology play/sale. It is not.

Photo Courtesy: Kurtis Garbutt

Sunday, March 31, 2013

Thrive For Precision Not Accuracy

Jake Porway who was a data scientist at the New York Times R&D labs has a great perspective on why multi-disciplinary teams are important to avoid bias and bring in different perspective in data analysis. He discusses a story where data gathered by Über in Oakland suggested that prostitution arrests increased in Oakland on Wednesdays but increased arrests necessarily didn't imply increased crime. He also outlines the data analysis done by Grameen Foundation where the analysis of Ugandan farm workers could result into the farmers being "good" or "bad" depending on which perspective you would consider. This story validates one more attribute of my point of view regarding data scientists - data scientists should be design thinkers. Working in a multi-disciplinary team to let people champion their perspective is one of the core tenants of design thinking.

One of the viewpoints of Jake that I don't agree with:

"Any data scientist worth their salary will tell you that you should start with a question, NOT the data."

In many cases you don't even know what question to ask. Sometimes an anomaly or a pattern in data tells a story. This story informs us what questions we might ask. I do see that many data scientists start with knowing a question ahead of time and then pull in necessary data they need but I advocate the other side where you bring in the sources and let the data tell you a story. Referring to design, Henry Ford once said, ""Every object tells a story if you know how to read it." Listen to the data—a story—without any pre-conceived bias and see where it leads you.

You can only ask what you know to ask. It limits your ability to unearth groundbreaking insights. Chasing a perfect answer to a perfect question is a trap that many data scientists fall into. In reality what business wants is to get to a good enough answer to a question or insight that is actionable. In most cases getting to an answer that is 95% accurate requires little effort but getting that rest 5% requires exponentially disproportionate time with disproportionately low return.

Thrive for precision, not accuracy. The first answer could really be of low precision. It's perfectly acceptable as long as you know what the precision is and you can continuously refine it to make it good enough. Being able to rapidly iterate and reframe the question is far more important than knowing upfront what question to ask; data analysis is a journey and not a step in the process.

Photo credit: Mario Klingemann

Friday, March 15, 2013

We Got Hacked, Now What?

Hopefully you really have a good answer for this. Getting hacked is no longer a distant probability; it's a harsh reality. The most recent incident was Evernote losing customer information including email addresses and passwords to a hacker. I'm an Evernote customer and I watched the drama unfold from the perspective of an end user. I have no visibility into what level of security response planning Evernote had in place but this is what I would encourage all the critical services to have:

Prevent

You are as secured as your weakest link; do anything and everything that you can to prevent such incidents. This includes hardening your systems, educating employees on social engineering, and enforce security policies. Broadly speaking there are two kinds of incidents - hijacking of a specific account(s) and getting unauthorizd access to a large set of data. Both of these could be devastating and they both need to prevented differently. In the case of Evernote they did turn on two-factor authentication but it doesn't solve the problem of data being stolen from their systems. Google has done an outstanding job hardening their security to prevent account hijacking. Explore shared-secret options where partial data loss doesn't lead to compromised accounts.

Mitigate

If you do get hacked, is your system instrumented to respond to such an incident? It includes locking acconts down, taking critical systems offline, assess the extent of damage etc. In the case of Evernote I found out about the breach from Twitter long before Evernote sent me an email asking to change the password. This approach has a major flaw: if someone already had my password (hard to decrypt a salted and hashed value but still) they could have logged in and changed the password and would have had full access to my account. And, this move—logging in and changing the password—wouldn't have raised any alarms on the Evernote side since that's exactly what they would expect users to do. A pretty weak approach. A slightly better way would have been to ask users to reset the password and then follow up with an email verification process before users could access the account.

Manage

If the accounts did get hacked and the hackers did get control over certain accounts and got access to certain sensitive information what would you do? Turns out the companies don't have a good answer or any answer for this. They just wish such things won't happen to them. But, that's no longer true. There have been horror stories on people losing access to their Google accounts. Such accounts are further used for malicious activities such as sending out emails to all contacts asking to wire you money due to you being robbed in . Do you have a multi-disciplinary SWAT team—tech, support, and communication—identified when you end up in such a situation? And, lastly, have you tested your security response? Impact of many catastrophes, natural or otherwise, such as flood earthquakes, and terrorist attacks can be reduced if people were prepared to anticipate and respond. Getting hacked is no different.

Photo courtesy: Daniele Margaroli