Wednesday, July 09, 2008

Data Mangling Basics

When you’re new to machine learning, there usually comes the point where you first have to face real data. At that point you suddenly realize that all your training was, well, academic. Real data doesn’t come in matrices, real data has missing values, and so on. Real data is usually just not fit for being directly digested by your favorite machine learning method (well, if it does, consider yourself to be very lucky).

So you spent some time massaging the data formats until you have something which can be used in, say, matlab, but the results you get aren’t just so good. If you’re lucky, there is some senior guy in your lab you can ask for help, and he will usually do some things to preprocess the data you’ve never heard of in class.

Actually, I think that it should be taught in class, even when there is no systematic methodology behind it, and no fancy theorems to prove.So here is my by no means exhaustive set of useful preprocessings.

So in summary, there is a lot of things to do with your data before you plug it into your data set. If you do it correctly, you will learn quite a lot about the nature of the data, and have transformed your data to learn more robustly.

I’d be interested to know what initial procedures you apply to your data, so feel free to add some comments.

Posted by at July 9, 2008, 15:04

blog comments powered by Disqus