Here are the slides of the talk I gave at the Apache Hadoop Get Together yesterday. The session was hosted at ImmobilienScout24 right next to the Ostbahnhof in Berlin.
On Real-Time Twitter Analysis.
I’ve been to a number of such meetings now which are somewhere in the intersection between academia and industry, and I have to say, I like the way such meetings initiate exchange between researchers and practitioners quite much. It seems there is quite some interest in data science and big data on both sides of the table. Companies are interested in the latest developments and researchers are very much interested in the demands of real-world applications.
So if you’re not afraid to get your hands dirty with some real technical challenges besides all that nice theory, you should definitely check whether such meetings also exist in your town and check them out!
The talk was filmed, check Isabel Drost’s blog for updates.
The term “data science” started to appear a few years ago and has continually gained traction. So what is it?
First of all, there is no such thing as “Data Science”. There is no scientific discipline called “data science”. You can’t go to an university to study data science. On the other hand, I agree that there is such a thing as a data scientist. Whenever I see someone calling himself a data scientist, I think that my own profile would probably also match that description. But what is it a data scientist does?
The way I see it, “data science” is a term coined to describe a special set of requirements and a certain role within web based companies which accumulate a huge amount of data and wish to make use of that information. Google probably was one of the first companies which became hugely successful based on a clever data analysis algorithm. Their PageRank algorithm provided much more accurate search results than other search engines at the time (and the site was much faster, too), showing the value of data analysis. At some point, the media took up the term “data science”, which sounded a bit like “rocket science”, together with the term “big data” and a new hype was born.
As I understand it, a data scientist is someone whose task is to create value for the company by developing data analysis solutions which add value to the business, and to implement these solutions in a production environment.
Of course, it’s not like people have suddenly invented a whole new way of dealing with data. In a way, science has always been “data science” in the sense that you collect data to support or disprove your hypotheses about nature. I think there are at least three different fields which have traditionally worked on topics which now form a huge part of what is called data science:
Statisticians, in particular computational statisticians, have been working with data for the last one or two centuries.
Machine learners have dealt with how to make sense of data since their beginnings in artificial intelligence in the 1960s.
Data mining and information retrieval people have studied how to use data base systems to extract meaningful information for the last few decades.
As I see it, what is new about data science is mostly the application area and the level of technical expertise required. Data scientists mostly deal with data in the form of links in a social network, click data, or some other kind of data generated from user behavior and interactions. So the data doesn’t come from some physics, biology or sociology experiment as in statistics, but is just something collected as part of your business.
A data scientist also needs to think about how to run such analyses in real-time in a production environment. This can be quite a challenge as you have to deal with incredibly huge data volumes (“web scale”) and you also have to keep up with the data in real-time more or less.
In the face of these two requirements, the three groups of people also have different profiles as data scientists (I’m simplifying a lot here, please don’t feel offended if you belong to one of the three groups).
Statisticians have very good knowledge of analysis methods which deal with noisy data. They know very well how to model stochastic data and processes resulting in robust analysis methods. On the other hand, statisticians are less versatile when it comes to scalability and data analysis technology. Statisticians mostly work within platforms such as R, importing and massaging data from various sources (CSV files, data bases, spread sheet and any other odd textual format used to store data), and then using the vast amount of available libraries to analyze and visualize the data. However, a platform like R, for all its versatility, is hardly fit to run in a production environment and scale out to meet the data volume requirements.
Machine learners are usually also quite good at statistical modelling as statistical methods form one of the main ways to devising machine learning algorithms. But often, machine learning methods are also a bit more heuristic and hands on. If some methods leads to high prediction accuracy, machine learners are fine with that, no matter what the statistical underpinnings say. Scalability is also a problem considered in machine learning, often in the context of optimization. On the other hand, machine learners tend to think in terms of vector spaces and linear algebra, which leaves quite a gap to data bases. In fact, most of the machine learners I know don’t regularly work with data bases but with text files or other custom data formats which contain their data. Also, just like statisticians, machine learners tend to work in settings where the data is fixed and needs to be analyzed, but not in a closed loop where the machine learning method is part of a larger system which needs to continually analyze data.
Data mining people (at least in my view) are quite familiar with data bases systems, but less knowledgeable about dealing with noisy data. Data mining people are probably closer to theoretical computer scientists who like to formally define a problem and then look for the most efficient algorithm to solve the problem (whereas statisticians and machine learners think about how to define the problem such that it can deal with noisy details while the computational aspects are often quite clear and already solved).
All of these groups will make good data scientists, but only if they have invested some time to study outside the range of topics usually covered in their field.
So in summary, here are some of the things a data scientist should know how to do:
Understand the business of a company and develop new ways to add value by analyzing the data.
Develop an analysis method based on the state of the art in statistics, machine learning, data mining, information retrieval, natural language processing, etc. This might require extending or adapting an existing method.
Prototype such a method using some platform like R, scipy, or matlab. While such platforms might not be fit for production, they are a great playing field to try out ideas and play around with data.
Transfer the method to a production environment. This will usually mean reimplementing the method in Java, C# or another “server grade” language (IMHO Python or Ruby isn’t enough here, sorry)
Devise ways to scale the method using any of the recent clustering technology, for example NoSQL data bases, stream processing frameworks, messaging, map reduce, etc.
Build ways to monitor the system to keep it running in production. Operations might eventually be passed on to other people, but since the systems are usually custom made one might have to also build a special monitoring system.
As you see, you have to cover quite a range, starting from a firm understanding of data analysis methods, to keeping up with such a fast moving field as scalable storage and clustering technology. But to me personally, this is exactly what defines the appeal of such a field like data science.
For more information, I have a data science stack on delicious where I’m compiling links to data science related articles, blogs, and so on.
Most people who decide to do a Ph.D. are well aware that it will mean a lot of work. You have to learn a lot of new stuff, possibly also outside of the topics you have studied so far. Taking machine learning as an example, you probably need to learn much more math than you’ve already been exposed to, including a mix of linear algebra, optimization theory, probability theory, statistics, and so on. But you also need to learn something about the area where you apply your methods, for example, bioinformatics, linguistics, and so on.
But at the same time, doing a Ph.D. also poses some psychological challenges and from my experience I can say that many students are quite surprised by the level of problems they face. In contrast to a Bachelor or a Master, which requires you to learn some topic and be able to apply what you’ve learned to new similar problems, doing a Ph.D. means doing something which hasn’t been done before. You need to solve a problem which hasn’t been solved before.
Now this may sound not that surprising because that’s what research is all about: exploring questions, solving problems, advancing the state-of-the-art. But you only realize what this really means when you’re one or two years into your graduate studies, you have learned quite a lot and come to understand the nature of the problem, and you realize that you have no idea how to solve the problem.
There is of course a lot you can do to hedge the risk of failing. For example, you can start with simpler subproblems and work yourself up towards the full problem. You can work on a number of smaller problems such that you build up a collection of work done. But at some point you will invariable find yourself in a situation when you have to admit that you really cannot know whether you’ll be able to solve the problem, or whether any of your usual strategies will help.
And this doesn’t even include the social aspects of doing a Ph.D., of getting published, getting cited, building up some form of reputation in the community.
I found myself in exactly this situation towards the end of my studies. I had to switch topics inbetween because the original idea didn’t quite turn out as expected. I wrote my thesis about convergence of eigenvalues and eigenvectors of the kernel matrix. But till the very end, a central proof was missing. I had run extensive numerical simulations so I was quite sure about what I wanted to prove, but only in the very end I managed to put the proof together. So here I was, with a few month left before my position ended, trying to solve that problem every day but not knowing whether I would be able to do that in the end or not. To illustrate my state of mind, when I moved to a different town, I couldn’t rent the truck of the size I had reserved but only one which was about a meter shorter. All my friends told me “Mikio, forget it, we’ll never get all your stuff in there”, but I was just like “ah, impossible, well, yes… .” In the end, everything except for one cupboard went in which was ok, and showed that we both had been wrong.
Actually, I have come to believe that this experience is part of what it means to do a Ph.D.. Eventually, you will succeed in one way or another, and you will have learned a very valuable lesson. You will see how the problem slowly sinks into your mind until your understanding of the problem will lead you to a solution, or uncover that it is not possible, but you will also have understood why.
In the end, doing a Ph.D. is exactly about this: Learning to do what no one has done before and be confident even when there is only a limited amount of time and you have no idea whether you will be able to solve the problem. And that is an important part of what science is about.