Algorithmic theory of MapReduce?
Last Friday, Il Professore gave a talk at Google NY on 'algorithmic theory of mapreduce'. As is his style, the talk was to draw attention to a topic and provide food for thought, technical results riding along mostly to make a point. Muthu spent a good chunk of time talking about the model and results from A Model of Computation for MapReduce by Karloff, Suri and Vassilvitskii. Let me chime in with some comments.
On a practical note, the MapReduce library we have at Google (with Hadoop being the open source equivalent) is a great tool that simplifies many data processing tasks. The typical use of MapReduce is to process logged data of some sort: search queries, records of auctions being run to sell ads on a page etc. Typically, the input data file contains a lot of extraneous information that is irrelevant to the task at hand. For example, information identifying machine serving the request, request timestamp or the list of matching ads is irrelevant if we are simply interested in counting how many times a certain query string was submitted to a search engine. Simply filtering out unneeded data (or better, not even reading it) drops the input size by an order of magnitude. Real world pipelines need to join data from multiple sources. For example, list of ad impressions found to be spam may be kept separately from a log file recording all ads shown, and determining the number of non-spammy impressions requires joining these two data sources. Similarly, if each ad impression is recorded in the log only by the ad-id, figuring out the country of the advertiser requires joining with a data store. If you are dealing with a log of transactions among participants in different timezones (e.g. the timezone of the viewer of an ad is typically different from the timezone of the advertiser, which is again likely different from the timezone of the server) and want to keep statistics in each participant's own timezone, you have yet another headache to deal with.
The previous paragraph was meant to say that for 95% of its users, MapReduce is a data manipulation tool, not a computational model: think SQL, not a Turing machine. Questions in this area are those of expressing queries, composition, query optimization etc. Dremel and Flume are useful examples in this direction.
Of the remaining 5% of users, a few do crazy stuff like using MapReduce as a convenient tool to launch hundreds of workers with a single command, and a few brave souls may even do things like training a machine learning model via MapReduce. Some care about computing stuff on large graphs, and may find tools like Pregel a better fit for their needs.
Now, for the remaining 1% of us who profess being theoretical algorithmicists. I am not terribly well versed in models of parallel/distributed computation, but I do have a feeling that quite a few papers have been written over the past 30+ years (here's one by Leslie Valiant mentioned in the Pregel post).
Here's a model of MapReduce for the theoretical algorithmicist. We have N machines, named 0..N-1 but otherwise identical. Each machine has local memory of O(N) words of O(log N) bits. Each machine is capable of arbitrary (randomized) polynomial time computation. Initially, the input to the computation is distributed among the machines, and each machine starts with a portion of input in its memory. A MapReduction is a computation consisting of a two local computation steps separated by a communication step. In the first computation step, each machine gets to run any local computation on its input that runs in poly time and fits within its local memory. The output of the first computation step is a set of messages to be sent to other machines, each message being annotated with the id of the recipient. In the communication step, each message gets delivered to its recipient. In the second local computation step, each machine processes messages it received using an arbitrary algorithm (that fits in space O(N) and time poly(N)). (If a machine in the second round wishes to access its local state from the first round, it can send it to itself as a message).
Comments