Friday, January 30, 2015
To make a long story short, I’ve decided to scale back my involvement with the streamdrill company to a purely advisory role. The reasons for this are naturally very complex, but in the end, I wasn’t seeing the kind of traction or the prospect of traction necessary to keep going at the pace I was going, splitting time between family, the university jobs, which paid my bills, and doing the dev work and marketing for streamdrill.
In fact I still believe the base technology is pretty compelling, so we’re going to open source the core, to allow me to continue to work on it. That’s something I had been wanting to do for some time, because in the Big Data community, having some part as open-source is necessary to get people to try this out. At streamdrill, we always had more of a focus on providing some directly usable end product, so this won’t hurt the company (which Leo is planning to continue.)
So the big question (or maybe not) is what to do now. In fact, I already got plenty to do… .
So I’m still at the TU Berlin, and let me whine about the situation here for one paragraph ;) It’s not ideal. I sort of have accepted for myself that my interests are just too applied for academia (one simply does not write software at my level anymore, people told me it’s suspicious and I should stop it). In terms of career I have moved up to a point where the work I’m expected to do is mostly teaching, advising students, and stuff like grant proposal and project management. And while I seem to do OK, this makes me deal with stuff I find extremely painful. On the plus side, it provides good job security and somewhat fair pay, but that will only get you so far, soulwise.
And the workload is pretty high. I have to do about a professor level of teaching, and am currently supervising about 5 students writing their master thesis and something like two to three Ph.D. students.
I’m sort of managing our side of the Berlin Big Data Center project. Luckily this project aligns well with my interests. It’s about bringing together machine learning people and people who build scalable distributed infrastructure. We’re closely related to the Apache Flink project, which is also really picking up lately. There’s lots of mutual interest, so I’m definitely looking forward to that.
There is also another project which is potentially coming up, so my current workload is two projects, half a dozen students, and about 20 or so students to supervise in four teaching courses.
I’ve recently started to join the InfoQ editorial board and try to cover about one Big Data related news item per week. And I’m again taking part in the 3rd batch of the Data Science Retreat starting in February.
And there’s still more stuff I’m interested in:
- jblas needs some love. My last serious updates are two years old, but with all that JVM based data analysis happening, jblas usage has picked up recently. I have some ideas to unclutter the code, make the whole build process more manageable, and maybe look into some new ideas to make use of native code also in cases where copying would be prohibitive, maybe by using caches or explicit memory handling.
- open source streamdrill, of course. Use of probabilistic data structures are picking up recently, and I always thought that it’s time to take it to the next level and write analysis algorithms which naturally use these structures as building blocks.
- There’s a lot of talk about data science / Big Data convergence, but based on the people who are doing Ph.D.s in machine learning at TU Berlin, the existing technology is still much too unwieldy to use. Ever tried setting up Hadoop from the sources? I simply cannot see that someone who is used to Python would want to do that. Spark, for example, is investing a lot in that area, but their machine learning efforts are still very rough and somewhat premature.
- Likewise, there is a lot of training under way to get more Data Scientists, but I think that the way data analysis is taught at universities is a very bad guideline, because that’s really trying to teach people to become researchers and create new data analysis methods, not use them reasonably. I think similar to the division between people who build tools and those who use tools to do something valuable with it, there needs to be a separation of training programs. And for that existing tools need to mature more. Scikit-learn, for example, is an awesome collection of many, many methods, but it has very little in terms of high-level stuff to support the process of data analysis.
- Notebooks is the new excel. I’m seeing a lot of use of IPython style notebooks lately to get to a more “literal” style of data analysis to get data analysis and business people to collaborate. Also the integration of code, plots, and results is really nice.
- Moving out of out-of-core-learning. After working with streaming for so long, the classical Python/R way of doing data analysis feels so weird. Why do I have to load all that data into memory? I understand that learning methods are so complex and data access patterns so random that this is the only way, but it now feels like a big restriction that your data set needs to fit into memory. Machine learning should be more like UNIX where stuff is file based and 10k C programs can work with gigabytes of data with 32MB of RAM if they need to (ok, I’m thinking of how it was back in 1994, but you get my point). And I’m not simply talking about data science on the command line, we probably need new algorithms for that, too.
And then there are even other odds and bits. I mean why is everything so complex nowadays? Just frameworks wrapping frameworks. CSS frameworks? I mean, c’mon! What about things which did one thing well and weren’t a pain to set up?
I want to keep attending more non-academic meetings. I’ll try to go to QCon London for at least one day, and I’ll be also speaking at Strata in London in May.
Still, the whole situation is hardly ideal. Maybe it’s asking too much of a job to have perfect alignment between interests and job related activities, but I think there’s room for improvement. Stay tuned.
Posted by Mikio L. Braun at 2015-01-30 17:03:00 +0100.
Monday, December 01, 2014
Giving a one day tutorial on data science is something I’ve been considering
in different contexts from time to time, but for different reasons it never
really happened. Finally, last Friday, the tutorial took place as a workshop
in the data2day conference, and I think it went pretty well. In this post I’d
like to talk a bit about our approach and our experiences.
The conference was organized by the heise publisher, well known in Germany for
their print magazines c’t and iX, which have been household names in IT since
the eighties. It was the first conference in the Big Data/Data Science
context organized by them, but already brought together over 150 participants.
For the workshop, I was happy to team up with Jan Müller and Paul Bünau from
idalab. In fact, Paul and I had developed a similar kind of hands-on
introduction to data analysis a few years ago while he was working on his PhD
at TU Berlin. Designed as a summer long course, the idea was to have students
implement a number of machine learning algorithms themselves. Each method
would first be presented by focussing on the main ideas, without going into
the theory too much. Then, the students would have two to three weeks time to
implement the method and play around with them on some toy data. During that
phase, we would have a weekly office hour where we would go around and talk to
the students individually to help them where they got stuck.
This course seemed to be quite popular with the students. We would still
randomly get praise for the course years later with students telling us that
this was among the courses where they learned most.
So when designing this one day workshop, the idea was from the beginning to
keep these two ingredients: Focus on main ideas and context, and a hands-on
It was particularly important to us to not just go through a bunch of learning
algorithms, but also stress how important is to know what you are doing. As I
have discussed before, it is too easy to put together some data analysis
pipeline and then not properly evaluate. Everything looks great, but in the
end you have just looked at training error, resulting in really bad
performance on future data.
For the hands-on part, we chose to work with IPython notebooks. These are
available on all major operating systems, notebooks can saved and loaded
easily, it integrates with plotting, and so on. Toolwise we chose to work with
numpy, pandas, [scikit-learn], and matplotlib. Originally the plan was
to have one session where we go through the basics of the tools and then two
use cases, but while putting the material together it became apparent that
there wasn’t enough time for two use cases, so we just sticked with a simple
example based on MNIST character recognition, and decision trees.
So in the end the course went like this:
about one hour if introductory course on what is data science/machine learning, and things like supervised vs. unsupervised learning, evaluation, cross-validation, etc.
one hour of going through the basics of numpy and pandas in an interactive IPython session
one hour of doing some exercises with numpy and pandas
another hour of going through an example with scikit-learn
two hours of doing the use case
The notebook from the example sessions were handed out at the beginning of the
exercises, and the exercises were prepared as IPython notebooks themselves
with free cells where you could put down your solutions.
As it is with all such things, you never know whether you thought of
everything, but all in all, we felt the workshop went very well. With three of
us, there was enough time to help each of the participants individually,
including fixing issues like finding out where IPython was keeping it files
under Windows, dealing with oddities of Python’s indexing scheme, and so on.
In the end, all participants had a running notebook which loaded the MNIST
data, learned a decision tree whose hyperparameter was adjusted by cross-
validation, giving them about 83% accuracy. Of course that is not optimal, but
already pretty good for a few lines of code. Most importantly, everyone now
has a complete framework from which they can start exploring other approaches,
try out new methods, and so on.
Next time, we would probably intersperse the background talk with the
solutions, such that there isn’t such a monolithic block at the beginning, and
be more careful with Python 3 vs Python 2. But overall I think our approach
worked out very well (also based on the feedback we got).
The workshop also showed that there is a real need of teaching people the more
high level concepts like proper validation. Unfortunately, even at
universities, the focus is too much on the methods themselves. Students often
learn the process and things like proper validation only when they work on
their master thesis. On the hand, for doing robust and reliable data analyses,
these things are absolutely essential.
Posted by Mikio L. Braun at 2014-12-01 12:15:00 +0100.
Thursday, October 02, 2014
What it takes to build a Big Data Solution
One question which pops up again and again when I talk about streamdrill is
whether that cannot be done by X, where X is one of Hadoop, Spark, Go, or some
other piece of Big Data infrastructure.
Of course, the reason why I find it hard to respond that question is that the
engineer in my is tempted to say “in principle, yes” which sort of questions
why I put all that work to rebuild something which apparently already exists.
But the truth is that there’s a huge gap between “in principle” and “in
reality”, and I’d like to spell this difference out in this post.
The bottom line is that all those pieces of Big Data infrastructure which
exists today provide you with a lot of pretty impressive functionality,
distributed storage, scalable computing, resilience, and so on, but not in a
way which solves your data analysis problems out of the box. The analogy I
like is that Big Data is a lot like providing you with an engine, a
transmission, some tires, a gearbox, and so on, but no car.
So let us consider an example where you have some clickstream and you want to
extract some information about your users. Think, for example, recommendation,
or churn prediction. So what steps are actually involved in putting together
such a system?
First comes the hardware, either on the cloud or by buying or finding some
spare machines, and then setting up the basic infrastructure. Nowadays, this
would mean installing Linux, HDFS, the distributed filesystem of Hadoop, and
YARN, the resource manager which allows you to run different kind of compute
jobs on the cluster. Especially when you go for the raw Open Source version of
Hadoop, this step requires a lot of manual configuration, and unless you
already did this a few times, this might take a while to get to work.
Then, you need to take in the data in some way, for example, by something
like Apache Kafka, which is essentially a mixture of a distributed log
storage and an event transport plattform.
Next, you need to process the data, which could either be done by a system
like Apache Storm, a stream processing framework which lets you distribute
computing once you have it broken down to pieces of computation taking in
an event at a time. Or you use Apache Spark which let’s you describe
computation on a higher level with something like a functional collection API
and can also be fed a stream of data.
Unfortunately, this still does nothing useful out of the box. Both Storm and
Spark are just frameworks for distributed computing, meaning that they allow
you to scale computation, but you need to tell them what you want to compute.
So you first need to figure out what to do with your data and this involves
looking at data, identifying the kind of statistical analysis which is suited
to solve your problem, and so on, and probably requires a skilled data
scientist to spend one to two month working on the data. There are projects
like mllib which provide more advanced analytics, but again these projects
don’t provide full solutions to application problems but are tools for a data
scientist to work with (And they are still somewhat early stage IMHO.)
Still, there’s more work to do. One thing people are often unaware of is that
Storm and Spark have no storage layer. This means that they both perform computation, but to get
to the result of the computation, you have to store it somewhere and have some
means to query it. This means usually to store the result in a database,
something like redis, if you want the speed of a memory based data storage, or
in some other way.
So by now we have taken care of how to get the data in, what to do with it and
how, and how to store the result such that we can query it while the
computation is going on. Conservatively talking, we’re already down six man
months, probably less if you have done it before and/or are lucky. Finally,
you also need to have some way to visualize the results, or if your main
access is via an API, to monitor what the system is doing. For this, more
coding is required, to create a web backend with graphs written in d3.js in
The resulting system probably looks a bit like this.
Lots of moving parts which need to be deployed and maintained. Contrast this
with an integrated solution. To me this is difference between a bunch of parts
and a car.
Posted by Mikio L. Braun at 2014-10-02 10:45:00 +0200.