My ongoing conversations with several people continue to reaffirm my belief that Data Science is still perceived to be a sacred discipline and data scientists are perceived to be highly skilled statisticians who walk around wearing white lab coats. The best data scientists are not the ones who know the most about data but they are the ones who are flexible enough to take on any domain with their curiosity to unearth insights. Apparently this is not well-understood. There are two parts to data science: domain and algorithms or in other words knowledge about the problem and knowledge about how to solve it.
One of the main aspects of Big Data that I get excited about is an opportunity to commoditize this data science—the how—by making it mainstream.
The rise of interest in Big Data platform—disruptive technology and desire to do something interesting about data—opens up opportunities to write some of these known algorithms that are easy to execute without any performance penalty. Run K Means if you want and if you don't like the result run Bayesian linear regression or something else. The access to algorithms should not be limited to the "scientists," instead any one who wants to look at their data to know the unknown should be able to execute those algorithms without any sophisticated training, experience, and skills. You don't have to be a statistician to find a standard deviation of a data set. Do you really have to be a statistician to run a classification algorithm?
Data science should not be a sacred discipline and data scientists shouldn't be voodoos.
There should not be any performance penalty or an upfront hesitation to decide what to do with data. People should be able to iterate as fast as possible to get to the result that they want without worrying about how to set up a "data experiment." Data scientists should be design thinkers.
So, what about traditional data scientists? What will they do?
I expect people that are "scientists" in a traditional sense would elevate themselves in their Maslow's hierarchy by focusing more on advanced aspects of data science and machine learning such as designing tools that would recommend algorithms that might fit the data (we have already witnessed this trend for visualization). There's also significant potential to invent new algorithms based on existing machine learning algorithms that have been into existence for a while. What algorithms to execute when could still be a science to some extent but that's what the data scientists should focus on and not on sampling, preparing, and waiting for hours to analyze their data sets. We finally have Big Data for that.
Image courtesy: scikit-learn