Tuesday, March 20, 2012
What is Data Science?
File under: machine room
The term “data science” started to appear a few years ago and has continually gained traction. So what is it?
First of all, there is no such thing as “Data Science”. There is no scientific discipline called “data science”. You can’t go to an university to study data science. On the other hand, I agree that there is such a thing as a data scientist. Whenever I see someone calling himself a data scientist, I think that my own profile would probably also match that description. But what is it a data scientist does?
The way I see it, “data science” is a term coined to describe a special set of requirements and a certain role within web based companies which accumulate a huge amount of data and wish to make use of that information. Google probably was one of the first companies which became hugely successful based on a clever data analysis algorithm. Their PageRank algorithm provided much more accurate search results than other search engines at the time (and the site was much faster, too), showing the value of data analysis. At some point, the media took up the term “data science”, which sounded a bit like “rocket science”, together with the term “big data” and a new hype was born.
As I understand it, a data scientist is someone whose task is to create value for the company by developing data analysis solutions which add value to the business, and to implement these solutions in a production environment.
Of course, it’s not like people have suddenly invented a whole new way of dealing with data. In a way, science has always been “data science” in the sense that you collect data to support or disprove your hypotheses about nature. I think there are at least three different fields which have traditionally worked on topics which now form a huge part of what is called data science:
Statisticians, in particular computational statisticians, have been working with data for the last one or two centuries.
Machine learners have dealt with how to make sense of data since their beginnings in artificial intelligence in the 1960s.
Data mining and information retrieval people have studied how to use data base systems to extract meaningful information for the last few decades.
As I see it, what is new about data science is mostly the application area and the level of technical expertise required. Data scientists mostly deal with data in the form of links in a social network, click data, or some other kind of data generated from user behavior and interactions. So the data doesn’t come from some physics, biology or sociology experiment as in statistics, but is just something collected as part of your business.
A data scientist also needs to think about how to run such analyses in real-time in a production environment. This can be quite a challenge as you have to deal with incredibly huge data volumes (“web scale”) and you also have to keep up with the data in real-time more or less.
In the face of these two requirements, the three groups of people also have different profiles as data scientists (I’m simplifying a lot here, please don’t feel offended if you belong to one of the three groups).
Statisticians have very good knowledge of analysis methods which deal with noisy data. They know very well how to model stochastic data and processes resulting in robust analysis methods. On the other hand, statisticians are less versatile when it comes to scalability and data analysis technology. Statisticians mostly work within platforms such as R, importing and massaging data from various sources (CSV files, data bases, spread sheet and any other odd textual format used to store data), and then using the vast amount of available libraries to analyze and visualize the data. However, a platform like R, for all its versatility, is hardly fit to run in a production environment and scale out to meet the data volume requirements.
Machine learners are usually also quite good at statistical modelling as statistical methods form one of the main ways to devising machine learning algorithms. But often, machine learning methods are also a bit more heuristic and hands on. If some methods leads to high prediction accuracy, machine learners are fine with that, no matter what the statistical underpinnings say. Scalability is also a problem considered in machine learning, often in the context of optimization. On the other hand, machine learners tend to think in terms of vector spaces and linear algebra, which leaves quite a gap to data bases. In fact, most of the machine learners I know don’t regularly work with data bases but with text files or other custom data formats which contain their data. Also, just like statisticians, machine learners tend to work in settings where the data is fixed and needs to be analyzed, but not in a closed loop where the machine learning method is part of a larger system which needs to continually analyze data.
Data mining people (at least in my view) are quite familiar with data bases systems, but less knowledgeable about dealing with noisy data. Data mining people are probably closer to theoretical computer scientists who like to formally define a problem and then look for the most efficient algorithm to solve the problem (whereas statisticians and machine learners think about how to define the problem such that it can deal with noisy details while the computational aspects are often quite clear and already solved).
All of these groups will make good data scientists, but only if they have invested some time to study outside the range of topics usually covered in their field.
So in summary, here are some of the things a data scientist should know how to do:
Understand the business of a company and develop new ways to add value by analyzing the data.
Develop an analysis method based on the state of the art in statistics, machine learning, data mining, information retrieval, natural language processing, etc. This might require extending or adapting an existing method.
Transfer the method to a production environment. This will usually mean reimplementing the method in Java, C# or another “server grade” language (IMHO Python or Ruby isn’t enough here, sorry)
Devise ways to scale the method using any of the recent clustering technology, for example NoSQL data bases, stream processing frameworks, messaging, map reduce, etc.
Build ways to monitor the system to keep it running in production. Operations might eventually be passed on to other people, but since the systems are usually custom made one might have to also build a special monitoring system.
As you see, you have to cover quite a range, starting from a firm understanding of data analysis methods, to keeping up with such a fast moving field as scalable storage and clustering technology. But to me personally, this is exactly what defines the appeal of such a field like data science.
For more information, I have a data science stack on delicious where I’m compiling links to data science related articles, blogs, and so on.
Posted by Mikio L. Braun at 2012-03-20 14:28:00 +0100