Tuesday, March 20, 2012

What is Data Science?

File under: machine room

The term “data science” started to appear a few years ago and has continually gained traction. So what is it?

First of all, there is no such thing as “Data Science”. There is no scientific discipline called “data science”. You can’t go to an university to study data science. On the other hand, I agree that there is such a thing as a data scientist. Whenever I see someone calling himself a data scientist, I think that my own profile would probably also match that description. But what is it a data scientist does?

The way I see it, “data science” is a term coined to describe a special set of requirements and a certain role within web based companies which accumulate a huge amount of data and wish to make use of that information. Google probably was one of the first companies which became hugely successful based on a clever data analysis algorithm. Their PageRank algorithm provided much more accurate search results than other search engines at the time (and the site was much faster, too), showing the value of data analysis. At some point, the media took up the term “data science”, which sounded a bit like “rocket science”, together with the term “big data” and a new hype was born.

As I understand it, a data scientist is someone whose task is to create value for the company by developing data analysis solutions which add value to the business, and to implement these solutions in a production environment.

Of course, it’s not like people have suddenly invented a whole new way of dealing with data. In a way, science has always been “data science” in the sense that you collect data to support or disprove your hypotheses about nature. I think there are at least three different fields which have traditionally worked on topics which now form a huge part of what is called data science:

As I see it, what is new about data science is mostly the application area and the level of technical expertise required. Data scientists mostly deal with data in the form of links in a social network, click data, or some other kind of data generated from user behavior and interactions. So the data doesn’t come from some physics, biology or sociology experiment as in statistics, but is just something collected as part of your business.

A data scientist also needs to think about how to run such analyses in real-time in a production environment. This can be quite a challenge as you have to deal with incredibly huge data volumes (“web scale”) and you also have to keep up with the data in real-time more or less.

In the face of these two requirements, the three groups of people also have different profiles as data scientists (I’m simplifying a lot here, please don’t feel offended if you belong to one of the three groups).

All of these groups will make good data scientists, but only if they have invested some time to study outside the range of topics usually covered in their field.

So in summary, here are some of the things a data scientist should know how to do:

As you see, you have to cover quite a range, starting from a firm understanding of data analysis methods, to keeping up with such a fast moving field as scalable storage and clustering technology. But to me personally, this is exactly what defines the appeal of such a field like data science.

For more information, I have a data science stack on delicious where I’m compiling links to data science related articles, blogs, and so on.

Posted by Mikio L. Braun at 2012-03-20 14:28:00 +0100

blog comments powered by Disqus