Tuesday, February 1, 2011

MapReduce and LSI: Stochastic projection, or SSVD - part I

Over past half a year or so my company had a need for a good and scalable method for Latent Semantic Analysis and related techniques. Since we are essentially a Hadoop shop, everything MapReduce comes naturally to us. So the hope was to find an open source LSA/LSI implementation based on series of Hadoop jobs.

The need wasn't the most pressing though, so I took time to research around. Actually, I started looking for a good scalable SVD algorithm even before I joined my current company. Mahout seemed to be most promising in this direction. The only problem was that Mahout hadn't had (and chances are, at the time you are reading this, still doesn't have) an end-to-end LSA pipeline. It did have most of what one would need: bigram/trigram or even n-gram selection method based on log-likelihood; pluggable lemma analysis; vectorization framework over sequence files that one could run directly in MapReduce. So the only missing parts are fast parallelizable SVD method and more or less helpful vector space index builder so we could turn LSA into LSI.

Over the years LSA was a desirable and easy-to-understand target for researches and NLP engineers alike. The problem with LSA has always been though that SVD algorithms that belie the LSA method were inherently slow and inherently RAM or supercomputer methods. Numerical Analysis community wasn't paying enough attention to cloud environment (and perhaps isn't paying enough even now). So SVD method for MapReduce or similar cloud processing environments was largely remaining an unattainable target.

Truth to be spoken, Mahout did ingest distributed Lanczos method, which is now available in Mahout-0.4. Compared to any stochastic method, it is certainly providing outstanding precision of computation. In discussion with Mahout community, however, stochastic method emerged as more suitable for bigger document corpus. Primarily, it is not clear that Lanczos method can be called "fast" enough to be practical for big corpus -- and truly parallel in a cloud sense: from what I read , it is still iterative even in its MapReduce incarnation and requires a separate MapReduce run for every singular value computed. (Things may have changed since I read about this and I apologize in advance if this does not accurately depict current reality). It is also presumably computationally heavier (including front-end) than using a stochastic projection preprocessing.

Stochastic projection methods were really well-outlined recently; there is a really excellent study published [Halko, et al]). To get an idea how fast this method could be, check out these performance numbers published for redsvd project. Those are really amazing numbers. I don't think Lanczos and ScaLaPack methods could come close to those in terms of speed, core per core.  (I suspect though java-based code would never be able to get exactly same performance natively as I think java in its pure form doesn't support SIMD).

The idea of stochastic SVD is to reduce problem dimensions by using random projection while keeping major driving factors of the dataset more or less intact. As a result, the problem may be reduced in size quite a bit. In terms of MapReduce implementation, we'd run problem projection & preprocessing using MapReduce (bulk number crunching) and then solve small problem in a jiffy inside the front-end process (so called "driver" in MapReduce lingo). The trade-off is a rather heavy loss in precision compared to other methods. However, problems like LSI are quite approximate: they are based on term collocation frequency analysis, which is circumstantial for a given corpus (or a given person's lifetime corpus, if we really try to compare method's output to human assessments). Bottom line, as such, Stochastic SVD method is unlikely suitable to crunch numbers for a rocket booster design but quite likely is what bulk computational linguistics needs.

No comments:

Post a Comment