Thursday, February 28, 2013

A Data Scientist's View On Skills, Tools, And Attitude

I recently came across this interview (thanks Dharini for the link!) with Nick Chamandy, a statistician a.k.a a data scientist at Google. I would encourage you to read it; it does have some great points. I found the following snippets interesting:

Recruiting data scientists:
When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don’t want to miss out on a great candidate because he or she comes from a non-statistics background and doesn’t search for the right keyword. On my team alone, we have had successful “statisticians” with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.
I share the same view. The best scientists I have met are not statisticians by academic training. They are domain experts and design thinkers and they all share one common trait: they love data! When asked how they might build a team of data scientists I highly recommend people to look beyond traditional wisdom. You will be in good shape as long as you don't end up in a situation like this :-)

The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into “pipelines.”
Most companies won't have the kind of highly skilled development army that Google has but then not all companies would have Google scale problem to deal with. Though I suggest two things: a) build a very strong community of data scientists using social tools so that they can collaborate on challenges and tools they use b) make sure that the chief data scientist (if you have one) has very high level of management buy-in to make things happen otherwise he/she would be spending all the time in "alignment" meetings as opposed to doing the real work.

Data preparation:
There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data.
I disagree. I do strongly believe the tools need to involve to do some of these things and the data scientists should not be spending their time to compensate for the inefficiencies of the tools. Becoming intimate with the data—have empathy for the problem—is certainly a necessity but spending time on pulling, fixing, and aggregating data is not the best use of their time.

To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one’s attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics.
As I would say bring tools and knowledge but leave bias and expectations aside. The best data scientists are the ones who are passionate about data, can quickly learn a new domain, and are willing to make and fail and fail and make.

Image courtesy: xkcd


Free Computer Education said...

Free Computer with computer course
Superb blog..nice, neat, very informative...
Thank you so much. I really enjoyed and hope you find it useful. Thank you for linking to it!

ZII-TECH Admission Center

Michael J. Moore said...

Thanks for dynamic representation of this content...It shows all the detailed information regarding data scientists and data science..I will personally share this content on twitter as well i will suggest my friends to visit your blog..They will surely gain some knowledge on it.

If you have a moment just take a look on my site

cloud computing service providers in India

cloud computing
in India

DCI Corporation said...

It seems that it would be highly beneficial to have the data scientists spend time on "pulling, fixing, and aggregating data" to have a base to work of off. Using the data they have pulled they can determine what improvements need to be made. But, I do agree that they should not spend a great deal of time with this task as they need to spend more time on actual production.

On another note, I read one of your earlier posts on SQL, and was wondering if you were Microsoft Certified? You made some excellent points and this question came to me purely out of curiosity. With my company, DCI, we have several Microsoft Certified staff check out our site if you are interested:

online accounting software said...

Nice to come this post. Nice to read this post. Great information. Thanks for sharing.

online accounting software