Data Base vs. Data Science
One thing which Big Data certainly made happen is that it brought the database/infrastructure community and the data analysis/statistics/machine learning communities closer together. As always, each community had it's own set of models, methods, and ideas about how to structure and interpret the world. You can still tell these differences when looking at current Big Data projects, and I think it's important to be aware of the distinctions in order to better understand the relationships between different projects.
Because, let's face it, every project claims to re-invent Big Data. Hadoop and MapReduce being something like the founding fathers of Big Data, other's projects have since appeared. Most notably, there are stream processing projects like Twitter's Storm who move from batch-oriented processing to event-based processing which is more suited for real-time, low-latency processing. Spark is yet something different, a bit like Hadoop, but puts greater emphasis on iterative algorithms, and in-memory processing to achieve that landmark "100x faster than Hadoop" every current project seems to need to sport. Twitter's summingbird project tries to bridge the gap between MapReduce and stream processing by providing us with a high-level set of operators which can then either run on MapReduce or Storm.
However, both Spark or summingbird leave me sort of flat because you can see that they come from a database background, which means that there will still be a considerable gap to serious machine learning.
So what exactly is the difference? In the end, it's the difference between relational and linear algebra. In the database world, you model relationships between objects which you encode in tables, and foreign keys to link up entries between different tables. Probably the most important insight of the database world was to develop a query language, a declarative description of what you want to extract from your database, leaving the optimization of the query and the exact details of how to perform them efficiently to the database guys.
The machine learning community, on the other hand, has it's root in linear algebra and probability theory. Objects are usually encoded as a feature vector, that is, a list of numbers describing different properties of an object. Data is often collected in matrices where each row corresponds to an object, and each column to a feature, not much unlike a table in a database.
However, the operations you perform in order to do data analysis are quite different from the data base world. Take something as basic as linear regression: your try to learn a linear function $f(x) = \sum_{i=1}^d w_ix_i$ in a $d$-dimensional space (that is, where your objects are described by a $d$-dimensional vector) given $n$ examples $X_i$, and $Y_i$, where $X_i$ are the features describing your objects and $Y_i$ is the real number you attach to $X_i$. One way to "learn" $w$ is to tune it such that the quadratic error on the training examples is minimal. The solution can be written in closed form as $$ w = (X X^T)^{-1}X Y $$ where $X$ is the matrix built from the $X_i$ (putting the $X_i$ in the columns of $X$), and $Y$ is the vector of outputs $Y_i$.
In order to solve this, you need to solve the linear equation $(X X^T)w = XY$ which can be done by one of a large number of algorithms, starting with Gaussian elimination, which you've probably learned in your undergrad studies, or the conjugate gradient algorithm, or by first computing a Cholesky decomposition. All of these algorithms have in common that they are iterative. They go through a number of operations, for example $O(d^3)$ for the Gaussian elimination case. They also need to store intermediate results. Gaussian elimination and Cholesky decomposition have rather elementary operations acting on individual entries, while the conjugate gradient algorithm performs a matrix-vector multiplication in each iteration.
Most importantly, these algorithms can only be expressed very badly in SQL! It's certainly not impossible, but you'd need to store your data in much different ways than you would in idiomatic database usage.
So it's not about whether or not your framework can support iterative algorithms without significant latency, it's about understanding that joins, group bys, and count() won't get you far, but you need scalar products, matrix-vector and matrix-matrix multiplications. You don't need indices for most ML algorithms, maybe except for being able to quickly find the k-nearest neighbors, because most algorithms tend to either take in the whole data set in each iteration or otherwise stream the whole set by some model which is iteratively updated like in stochastic gradient descent. I'm not sure project like Spark or Stratosphere have fully grasped the significance of this yet.
Database infrastructure-inspired Big Data has it's place when it comes to extracting and preprocessing data, but eventually, you move from database land to machine learning land, which invariably means linear algebra land (or probability theory land, which often also reduces to linear algebra like computations). What often happens today is that you either painstakingly have to break down your linear algebra into MapReduce jobs, or you actively look for algorithms which fit the database view better.
I think we're still at the beginning of what is possible. Or to be a bit more aggressive, claims that existing (infrastructure, database, parallelism inspired) frameworks provide you with sophistic data analytics are widely exaggerated. They take care of a very important problem by giving you a reliable infrastructure to scale your data analysis code, but there's still a lot of work that needs to be done on your side. High-level DSLs like Apache Hive or Pig are a first step in this direction but still too much rooted in the database world IMHO.
In summary, one should be aware of the difference between a framework which mostly is concerned with scaling, and a tool which actually provides some piece of data analysis. And even if it comes with basic database-like analytics mechanisms, there is still a long way to go to do some serious data science.
That's why we're also thinking that streamdrill occupies an interesting spot, because it is a bit of infrastructure, allowing you to process a serious amount of event data, but it's also provides valuable analysis, based on algorithms you wouldn't want to implement yourself, even if you had some Big Data framework like Hadoop at hand. That's an interesting direction I also would like to see more of in the future.
Note: Just saw that Spark has a logistic regression example on their landing page. Well, doing matrix operations explicitly via map() on collections doesn't count in my view ;)
Comments (8)
Good stuff, I definitely agree with you, Mikio, the difference between traditional Database and Big Data consists in the different approach to machine learning and knowledge extraction, that could be only achieved by analyzing the whole dataset (not just a subset), in order to exploit the inner and unknown relations among data itself...
Very interesting. I was just reading an article with a somewhat different perspective, here's a quote regarding future directions for SQL and RDBMS.
"Co-optimizing storage and queries for linear algebra: There are many choices for laying out matrices across nodes in a cluster [2]. We believe that all of them can be captured in records, with linear algebra methods written over them in SQL or MapReduce syntax." http://db.cs.berkeley.edu/p...
I am just learning about this area and not able to comment, but I'd be interested in your perspective on this other viewpoint?
thanks for the link, looks interesting. On first sight looks like they've added storage optimized linear algebra functions to SQL. Not really sure whether SQL is still the right encoding language for the kind of computations. Interestingly, iterative procedures are still missing. I also wonder what they do when there are conditionals in the code. Right now, the code fragments show little bits and pieces of real algorithms but at first sight it looks like they stayed clear of the complex things.
Hi Mikio!
So - SciDB (http://www.scidb.org) architect here. For readers wondering if they should be interested in reading the long note that follows, the TL;DR story is that SciDB is an open-source DBMS (like Postgres/MySQL only our license is GPL3) built on a massively-parallel architecture (like Teradata and DB2) but supporting an array data model (everything in SciDB is an n-dimensional array, and the query language's basic building blocks are based on an array-algebra). Still interested? Read on ...
Up front, I want to say that your particular set of observations / complaints about how SQL engines and map/reduce frameworks function--both at a data model, and a detailed design choice level--are exactly the set of complaints that motivated the development of SciDB. Within SciDB, we support linear algebra operators at the level of "GESVD ( Array, 'left|right|U')" or "multiply ( A, B, C )" which computes "A * B + C". But it's important to grok that these operators fall out of the fact that start with an array data model, and these are obvious ways of exploiting it.
Now, in contrast with some of the other approaches to supporting massively parallel linear algebra--which start with an execution framework like a relational tuple calculus or a map/reduce model--we went with the approach of asking the small but tremendously experienced community of people who did large scale linear algebra for a living (ScaLAPACK, for example) what to do, and copied them. In other words, our architectural approach tries to follow the best practices of that community, rather than trying to shoe-horn complex algorithms into the latest "kewl dewd!" architecture. Talk to those guys, and the entire conversation is about block-cyclic chunking, block partitioned algorithms, and so on. They don't talk about mappers and reducers, or about tuple streams, sorts and joins.
From our point of view, having single architecture goggles is very limiting. Pick a platform that most naturally fits the kinds of algorithms you're running and use that.
Of course, having bashed 'em, I'm now going to walk back some of the "those guys are really dumb" flavor in the previous remarks. Once you try building a platform that's a general purpose tool for something like analytics or machine learning, you pretty quickly figure out that to be useful, you need to do a lot more than just run the algorithms. Most analytic systems combine data from multiple sources, have multiple users, with multiple analytic objectives, multiple client tools. Consequently, you end up needing a lot of the things that good ol' SQL systems provide--transactions, query languages, open APIs--and a lot of things that Hadoop provides--elasticity, resilience in the face of failure, open stack physical building block computers.
From our point of view, it's important not to throw babies out with bath water. We haven't build a "linear algebra only" system, because the overall task is going to involve a lot of other features that SQL & Map/Reduce have figured out how to provide.
Make sense?
thanks for this insightful comment. I fully agree that linear algebra only is not the right way to go, as the "classical" map reduce kind of jobs are very well suited for the initial data extraction and feature transformation step. SciDB looks very interesting, thanks for sharing!
Hi Mikio, thanks for that insightful article!
I agree very much with your observation that there is a big discrepancy between the language and thinking of DB software engineers vs Data Scientists. That's actually something I experience in my own job a lot, either in talking to the IT department of the customer or to our own DB engineers. I often have to do a lot of explaining what it is that I actually do and how having a Hadoop or SQL environment by itself doesn't really enable me to conduct proper analytical work.
streamdrill indeed seems like an interesting step forward. What algorithms do you currently have implemented? From what I understand, I imagine it could be interesting to implement online learning algorithms on top of it.
I also want to give you a pointer to MADlib (www.madlib.net), which is an Open-Source SQL library for Machine Learning innovated in my organization. I'm also not really a big fan of the SQL syntax, but MADlib scales very well on top of our DB and we use it successfully in a lot of customer engagements. Still, I'm very excited to see what future developments we will see in this space.
Best, Alex
Hi Alex,
interesting insights. There is a technology demo of streamdrill which you can download here http://streamdrill.com. This talk http://de.scribd.com/doc/13... has some ideas to what's possible.
Best, Mikio
Well, that gap is what keeps engineers like myself gainfully employed! :-)