Here's the doc, also attached to Mahout-593. At some point wiki update is due. When we know what it is all good for.
Also, from my email regarding -s parameter:
There are 2 cases where you might want to adjust -s : 1 -- if you running really huge input that produces more than 1000 or so map tasks and/or that is causing OOM in some tasks in some situations. It looks like your input is far from that now. 2 -- if you have quite wide input -- realistically more than 30k non-zero elements in a row. The way current algorithm works, it tries to do blocking QR of stochastically projected rows in the mappers which means it needs to read at least k+p rows in each split (map task). This can be fixed and i have a branch that should eventually address this. In your case, if there happen to be splits that contain less than 110 rows of input, that would be the case where you might want to start setting -s greater than DFS block size (64mb) but it has no effect if it's less than that (which is why hadoop calls it _minimum_ split size). I don't remember hadoop's definition of this parameter, i think it is in bytes, so that means you probably need to specify something like 100,000,000 to start seeing decrease in number of the map tasks. But honestly i never tried this yet since i never had input wide enough to require this.