Marginally Interesting
· Mikio L. Braun

Data Analysis: The Hard Parts

I don't know whether this word exists, but mainstreamification is what's happening to data analysis right now. Projects like Pandas or scikit-learn are open source, free, and allow anyone with some Python skills do lift some serious data analysis. Projects like MLbase or Apache Mahout work to make data analysis scalable such that you can tackle those terabytes of old log data right away. Events like the Urban Data Hack, which just took place in London, show how easy it has become to do some pretty impressive stuff with data.

The general message is: Data analysis has become super easy. But has it? I think people want it to be, because they have understood what data analysis can do for them, but there is a real shortage in people who are good at it. So the usual technological solution is to write tools which empower more people do it. And for many problems, I agree that this is how it works. You don't need to know TCP/IP to fetch some data from the Internet because there are libraries for that, right?

For a number of reasons, I don't think that you can "toolify" data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I'd say it takes a lot of experience to be done properly and you need to know what you're doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

And I don't write this because I don't like the projects which exists, but because I think it is important to understand that you can't just give a few coders new tools and they will produce something which works. And depending on how you want to use data analysis in your company, this might break or make your company.

So my top four reasons are:

  1. data analysis is so easy to get wrong
  2. it's too easy to lie to yourself about it working
  3. it's very hard to tell whether it could work if it doesn't
  4. there is no free lunch

Let's take these one at a time.

Data Analysis is so easy to get wrong

If you use a library to go fetch some data from the Internet, it will give you all kinds of error message when you do something wrong. It will tell you if the host doesn't exist or if you called the methods in the wrong order. The same is not true for most data analysis methods, because these are numerical algorithms which will produce some output even if the input data doesn't make sense.

In a sense, Garbage In Garbage Out is even more true for data analysis. And there are so many ways to get this wrong, like discarding important information in a preprocessing step, or accidentally working on the wrong variables. The algorithms don't care, they'll give you a result anyway.

The main problem here is that you'll probably not even notice it apart from the fact that the performance of your algorithms isn't what you expect them to be. In particular when you work with many input features, there is really no way to look at the data. You are basically just working with large tables.

This is not just hypothetical, I have experienced many situations where exactly that happened, where people were accidentally permutating all their data because they messed up reading the data from the files, or did some other non-obvious preprocessing which destroyed all the information in the data.

So you always need to be aware of what you are doing and have to mentally trace the steps to have a well-informed expectation about what you are doing. It's debugging without an error message, often you just have a gut feeling that something is quite wrong.

Sometimes it's not even that the performance is bad, but also because it's just very good. Let's come to that next.

It's too easy to lie to yourself about it working

The goal in data analysis is always good performance on future, unseen data. This is quite a challenge. Usually you start working from collected data, which you hope is representative of the future data. But it is so easy to fool yourself into thinking it works.

The most important rule is that only test performance on data which you haven't used in any way for training is reliable for future performance. However, this rule can be violated in many, sometimes subtle ways.

The classical novice mistake is to just take the whole data set, train an SVM or some other algorithm and look at the performance you get on the data set you used for training. Obviously, it will be quite good. In fact, you can achieve perfect predictions when you just output the values you got for training (ok, if they are unambiguous) without any real learning taking place at all.

But even if you split your data right, people often make the mistake of using information from the test data in the preprocessing (for example for centering, or building dictionaries, etc.). So you do the actual training only on the training data, but through the preprocessing information has silently crept into the test data as well, giving results which are much better than what you can realistically expect on real data.

Finally, even if you do proper testing and evaluation of your method, your estimates of future performance will become optimistic as you try out many different approaches because you implicitly optimize for the test set as well. This is called multiple testing and something one has to be aware of, too.

One can be trained to do all this properly, but if you are under the pressure to produce results, you have to resist the temptation to just run with the first thing which gives good numbers. And it helps if you've gone that route once and failed miserably.

And even if you did evaluate according to all secrets of the trade, the question is still whether the data you worked on was really representative of future data.

It's very hard to tell whether it could work if it doesn't

A different problem is that it is fundamentally difficult to know whether you can get better if your current approach doesn't work well. The first thing you try will most likely not work, as will probably the next thing, and then you need someone with experience to tell you whether there is a chance or not.

There is really no way to automatically tell whether a certain approach works or not. The algorithms just extract whatever information fits their model and the representation of the data, but there are many, many ways to do this differently, and that's when you need a human expert.

Over time you develop a feeling for whether a certain piece of information is contained in the data or not, and ways to make that information more prominent through some form of preprocessing.

Tools only provide you with possibilities, but you need to know how to use them.

There is no free lunch

Now you might think "but can't we build all that into the tools?" Self-healing tools which tell you when you make mistakes and automatically find the right preprocessing? I'm not saying that it's impossible, but these are problems which are still hard and unsolved in research.

Also, there is no universally optimaly learning algorithm as shown by the No Free Lunch Theorem: There is no algorithm which is better than all the rest for all kinds of data.

No way around learning data analysis skills

So in essence, there is no way around properly learning data analysis skills. Just like you wouldn't just give a blowtorch to anyone, you need proper training so that you know what you're doing and produce robust and reliable results which deliver in the real-world. Unfortunately, this training is hard, as it requires familiarity with at least linear algebra and concepts of statistics and probability theory, stuff which classical coders are not that well trained in.

Still, it's pretty awesome to have those tools around, back when I started with my Ph.D. everyone had his own private stack of code. Which is ok if you're in research and need to implement methods from scratch anyway. So we're definitely better off nowadays.

Comments (26)

T
Trey 2014-02-17

Great post! FYI -- the book links do not appear if you are using an ad blocker.

M
mikiobraun 2014-02-17

Thanks, I know. Amazon affiliate links, sort of broken as designed... .

D
dv 2014-02-19

Agree with the general sentiment but as the big
data tsunami reaches mainstream customers the hope is they will act far more
conservatively wrt "data management" as they have over the past few decades
dealing with relational data. Plus, data governance issues kick-in
too.

It is more likely that "data analysis" will not
become mainstream but that the subsequent "data pre-processing" stage using some
map/reduce infrastructure (eg. hadoop, storm, spark, etl or homegrown) will
rapidly commoditize as it is essentially applying compute power to process large
amounts of data as quickly as possible using map/reduce schemas from the data
analysis phase. Once the data has been pre-processed, the machine learning
application can be built, tested and deployed.

Currently the big data noise is dominated by the
hadoop et al vendors but that will change as mainstream customers realize that
the real value is in the front with data analysis and at the back with machine
learning methods.

B
Bill Shannon 2014-03-05

I agree with the point that data analysis will not become mainstream but that data preprocessing will. It is very important to distinguish between these stages.

In my lab and LLC we work with groups who have huge data pipeline preprocessing needs which are automated -- brain imaging data, microbiome data, and proteomic data. Within the preprocessing pipelines is huge expertise in mathematics, engineering, computer science, and the subject area.

Our involvement starts when this data is to be analyzed in relation to patient outcomes -- known as translational research. The tools in this step (what I call data analysis) are very different than pipeline tools (what I call data preprocessing).

Hadoop gets the data to a format where translational analysis needs to step in.

M
mikiobraun 2014-03-12

Hi Bill,

I agree, although when I say "data analysis", I'm thinking more of "data analysis design". Once you've set up your data and found something which solves your case well, that can be automated, of course. Or scaled up, etc.

-M

R
Rolf 2014-02-23

I already read ESL and plan to buy a bayesian book + the McKay one. If you could only own one, between Bishop and Barber which one do you prefer? Barber seems more modern / didactic at first sight.

M
mikiobraun 2014-02-23

Hi, both books are also available online so you can have a first view and then if you like it buy the book.

David McKay's book: http://www.inference.phy.ca...

David Barber's book: http://web4.cs.ucl.ac.uk/st...

I'd say that David Barber focusses more on the ML part while David McKay goes beyond into Information Theory and Coding Theory. So I'd probably go with David Barber's book.

A
alexgmcm 2014-02-24

I have used both on my ML course and I find Barber much more comprehensible than Bishop. (also it is available online)

McKay is also extremely clear and well written. I haven't read either of the first two, but I've seen the first one 'Statistical Learning' recommended in many places.

A
Al 2014-03-04

Good post. Pandas/R, etc are great and help us to experiment quickly. I think we should also include that fact that the results need to be implemented by people. Check the good writeup in the new book Thinking with Data, which covers these important aspects of data science.

M
Max 2014-03-08

70% data prep. 30% modelling and interpretation.

J
joshuaadelman 2014-03-10

The folks who brought you ESL, also have a simpler book that they are using to teach their Statistical Learning MOOC through Stanford, and it is available online for free: http://www-bcf.usc.edu/~gar...

K
Keith Trnka 2014-03-12

Great post. I have to wonder if some of those things can be easier. Perhaps unit testing for preprocessing code can reduce effort in #1. I can't imagine it ever going away; after all debugging non-ML code hasn't gone away or anything.

On #2 I agree entirely and I'd go further - understanding the dangers of poor testing is important but it's incredibly important to be able to explain the value of rigorous testing to a non-technical person.

A
Andy 2014-03-12

Agree with the sentiment, but there seems to be a short circuit in the argumentation, data analysis and prediction is not necessarily completely overlapping spheres.
A lot of the work I do is not about finding good predictors, but analysing and finding answers in existing data volumes. Predicting future is only one of many outcomes from thorough data analysis.

Toolkitifying is good, but even the best tool chest in the world will not make you a good carpenter until you learn how to use your tools and even better in which situations to use which tool. This still requires some knowledge, and the current crop of tools doesn't change that.

Statwing and some of the other attempts are cool, but still doesn't remove the knowledge of the user from the loop.

M
mikiobraun 2014-03-12

Hi Andy,

yeah, thanks for pointing that out. Data analysis is of course a lot more than prediction. Things like what linkedin is doing to mine their relation graph to find relevant suggestions come to mind.

For the sake of the argument, I was mostly speaking of prediction only, but of course there's much more to it. Although I think the general argument that you need to know what you're doing still holds. Probably all the more for unsupervised methods where there often is no obvious evaluation criterion.

-M

B
Brent Sitterly 2014-03-19

Super sweet post. I think in the current world of Google Analytics style ease of access to data this topic is so important to keep at the surface. Having conversations with folks recently about this topic revealed to me a real world example of how our math/science education is totally lacking....

L
Lee Jones 2014-03-27

Great post.

One other book to throw into the mix that I've recommended to several people -- "Guide to Intelligent Data Analysis" with a link to the author site below.

http://www.informatik.uni-k...

Importantly it covers a process and framework for people to try and guard as much as possible against fooling yourself where the algorithms themselves are kind of secondary. You end up finding a process that works for you, but its important to realize that there is an iterative process involved with constant validation and challenging of results. I've been doing this stuff a long time in the context of learning systems, computational biology, computer security work etc and there really isn't an easy way to get good at _data analysis_ short of doing it on real problems and at scale.

Algorithms knowledge is important and is the basic material necessary to start the process. I'd also up-vote your selection as I've got all but the Barber book and they are great resources.

M
mikiobraun 2014-03-28

Thanks! I fully agree on the "it's best to learn from actual experience". That book also sounds very interesting. I'm currently reading "Data Science for Business" http://shop.oreilly.com/pro... which also emphasizes the process side very much (while being somewhat superficial concerning methods and technology, though).

P
Philo Janus 2014-04-07

In the mid-90s, Microsoft Access was supposed to "make databases easy" - then high-ticket DBAs got to spend ten years paying their mortgages cleaning up badly designed Access applications

SharePoint was supposed to "make enterprise content management easy" - now we have eDiscovery teams that bill out at $500/hr to deal with the legal ramifications of business documents scattered across sites because SharePoint was thrown in place then treated like a shared drive.

Every time something becomes "toolified" the problem is that the business doesn't understand that the hard work is still ahead of them - solving the business problems, answering the business questions, and performing the actual governance it takes to make the tools useful. Home Depot sells hammers and lumber, and while some people have the skill and dedication to build their own house, most folks are smart enough to hire someone that knows what they're doing so the thing doesn't fall in and kill their family.

M
mikiobraun 2014-04-10

Hi Philo,

thanks for your comment, I totally agree. There's always the hope you can somehow control the complexity with just the right tools, but as you have pointed out, again and again blind faith in the power of tools leads to the exact opposite!

-M

M
Mike 2014-05-21

Hmm, I disagree. This is from the perspective of a programmer:

Visual basic changed the world. It didn't help anyone solve harder problems, and it didn't make you a better programmer - but it greatly expanded the number of people who *could* program, because the new, simplified tools lowered the barrier to entry for people who wouldn't otherwise be able to get started (and get paid for it).

The result was a huge increase in the pool of programmers (whether or not they `deeply understood' computer science or some other such nonsense), and a consequent huge increase in the amount of programming work that businesses could reasonably hope to get done. There was an unfilled need, a hunger, to automate far more things than had been possible while programming was still `hard'. So businesses simply gobbled these programmers up as soon as they became available and put them to work building their apps. It didn't matter whether or not they were stellar programmers - most business needs are not rocket science (well, not when you have great tools).

The result is that the industry hugely expanded, lots more stuff got done (and more needs got filled) because of it. At a guess, these kinds of "entry level" programmers probably did 80% of all of the programming work that was getting done in the world - which freed the specialists to work on the harder stuff - stuff which wouldn't have gotten started but for the people that could now shoulder the load of taking care of these other business needs. Note that this demand has NOT yet been sated - every new simplification has led to new applications becoming (economically) reasonable, and consequent industry expansion.

I've been programming for 30 years or so, and have seen this happen multiple times. I'm grateful for all those people, and for the tools that helped them become productive - I get to concentrate on more interesting things, which otherwise might not have been possible. And, yes, occasionally I have to clean up after people who got in over their heads - but they usually *don't* fail, and if they do it's just another business need (and it's actually pretty rare).

So look at where your industry is today - the previously arcane, primitive tools (and the high price of really good ones), the slowly emerging toolkits, and the burgeoning pool of talent that they're helping enable. You're about to have your VB moment - that is, become economically feasible for the mass market - and exactly the same thing is about to happen. You'll never regret it.

Disclaimer: I don't even know Visual Basic (or Excel, for that matter).

Cheers :)

M
Mike (again) 2014-05-21

I might add: yes, it's a simplified story (lots of other stuff helped make this happen - like the emergence of PCs, and, particularly, Lotus 123 - because tools like that helped everyone get used to the idea that this sort of automation was possible). The same is unfolding now, in data science. Enjoy it.

D
David Hawley 2014-05-30

The problem is how to tell whether the answers are right or not.

Keeping with your example: in software development, you have to ask the user, who is a kind of 'oracle" who usually can tell you whether the result is sensible. Many project fail, because the user is not asked (until the last minute) and the results were not correct. So what if there was no knowledgeable user at all?

L
Louis Dorard 2014-06-23

Great point. By the way, this "VB moment" is now. To stay in the Microsoft world, check out Azure ML: http://azure.microsoft.com/... It hasn't launched publicly yet, but it already has competitors such as BigML and Google Prediction API.

M
mikiobraun 2014-06-23

Oh, yeah! You're so right! Hadn't seen this connection.

L
Louis Dorard 2014-05-26

Regarding the No Free Lunch theorem: you could always try a bunch of algorithms, cross-validate, and automatically see which work on the data and which do not? Also, what's your take on Zoubin Ghahramani's Automatic Statistician (http://mlg.eng.cam.ac.uk/?p...

T
Tirthankar 2014-05-27

Some matter in inferential.blogspot.com.au may be relevant to this discussion..

Back to all posts