Data Mangling Basics

When you're new to machine learning, there usually comes the point where you first have to face real data. At that point you suddenly realize that all your training was, well, academic. Real data doesn't come in matrices, real data has missing values, and so on. Real data is usually just not fit for being directly digested by your favorite machine learning method (well, if it does, consider yourself to be very lucky).

So you spent some time massaging the data formats until you have something which can be used in, say, matlab, but the results you get aren't just so good. If you're lucky, there is some senior guy in your lab you can ask for help, and he will usually do some things to preprocess the data you've never heard of in class.

Actually, I think that it should be taught in class, even when there is no systematic methodology behind it, and no fancy theorems to prove.So here is my by no means exhaustive set of useful preprocessings.

Take subsets You might find this pretty obvious, but I've seen students debugging their methods directly on the 10000 instances data set often enough to doubt that. So when you have a new data set and you want to quickly try out several methods, take a random subset of the data until your method can handle the data set in seconds, not minutes or hours. The complexity of most data sets is also such that you usually get somewhat close to the achievable accuracy with a few hundred examples.
Plot histograms, look at scatter plots Even when you have high-dimensional data, it might be very informative to look at histograms of individual coordinates, or scatter plots of pairs of coordinates. This already tells you a lot about the data: What is its range? Is it discrete or continuous? Which directions are highly correlated, and which are not? And so on. Again, this might seem pretty obvious, but often students just run the specified method without looking at the data first.
Center and normalize While most method papers make you think that the method "just works", in reality, you often have to do some preprocessing to make the methods work well. One such step is to center and normalize your data to have unit variance in each direction. Don't forget to save the offset and normalization factors: you will need them to correctly process the features for prediction!
Take Logarithms Sometimes, you have to take the logarithms of your data, in particular when the range of the data is extremely large and the density of your data points decreases as the values become larger. Interestingly, many of our own senses work on a logarithmic scale, as for example, hearing and vision. Loudness is for example measured in decibel, which is a logarithmic scale.
Remove Irrelevant Features Some kernels are particularly sensitive to irrelevant features. For example, if you take a Gaussian kernel, each feature which is irrelevant for prediction increases the number of data points you need to predict well, because for practically each realization of the irrelevant feature, you need to have additional data points in order to learn well.
Plot Principal Values Often, many of your features are highly correlated. For example, the height and weight of a person is usually quite correlated. A large number of correlated features means that the effective dimensionality of your data set is much smaller than the number of features. If you plot the principal values (eigenvalues of the sample covariance matrix $C^TC$, you usually will notice that there are only a few large values, meaning that a subset of the space will already contain most of the variance of your data. Also note that a projection to those dimensions is the best low-dimensional approximation with respect to the squared error, so using this information, you can transform your problem in one with fewer features.
Plot Scalar Products of the output variable with Eigenvectors of the Kernel Matrix Finally, while principal values only tell you something about the variance in the input features, if you plot the scalar products between the eigenvectors of the kernel matrix and the output variable in a supervised learning setting, you can see how many (kernel) principal components you need to capture the relevant information about the learning problem. The general shape of these coefficients can also tell you if your kernel makes sense or not. Watch out for an upcoming JMLR paper for more details.
Compare with k-nearest neighbors Finally, before you apply your favorite method on the data set, try something really simple like k-nearest neighbors. This can get you a good idea of what kinds of accuracies you can expect.

So in summary, there is a lot of things to do with your data before you plug it into your data set. If you do it correctly, you will learn quite a lot about the nature of the data, and have transformed your data to learn more robustly.

I'd be interested to know what initial procedures you apply to your data, so feel free to add some comments.

React to this post