Distributed Programming with MapReduce > Other MapReduce Examples

23.3. Other MapReduce Examples

We'll examine the implementation of a much more sophisticated driver program that automatically runs MapReduce programs on large-scale clusters of machines in a moment, but first, let's consider a few other problems and how they can be solved using Map-Reduce:


Distributed grep

The Map function emits a line if it matches a supplied regular expression pattern. The Reduce function is an identity function that just copies the supplied intermediate data to the output.


Reverse web-link graph

A forward web-link graph is a graph that has an edge from node URL1 to node URL2 if the web page found at URL1 has a hyperlink to URL2. A reverse web-link graph is the same graph with the edges reversed. MapReduce can easily be used to construct a reverse web-link graph. The Map function outputs <target, source> pairs for each link to a target URL found in a document named source. The Reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair <target, list of source URLs>.


Term vector per host

A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The Map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The Reduce function is passed all per-document term vectors for a given host. It adds these term vectors, throwing away infrequent terms, and then emits a final <hostname, term vector> pair.


Inverted index

An inverted index is a data structure that maps from each unique word to a list of documents that contain the word (where the documents are typically identified with a numeric identifier to keep the inverted index data relatively compact). The Map function parses each document and emits a sequence of <word, docid> pairs. The Reduce function accepts all docids for a given word, sorts the corresponding document IDs, and emits a <word, list of docids> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions within each document.


Distributed sort

MapReduce can also be used to sort data by a particular key. The Map function extracts the key from each record, and emits a <key, record> pair. The Reduce function emits all pairs unchanged (i.e., the identity Reduce function). This computation depends on the partitioning facilities and ordering properties described later in this chapter.

There are many more examples of computations that can easily be expressed as a Map-Reduce computation. For more complex computations, it is often easy to express them as a sequence of MapReduce steps or as an iterative application of a MapReduce computation, where the output of one MapReduce step is the input to the next MapReduce step.

One you start thinking of data processing problems in terms of MapReduce, they are often relatively easy to express. As some testament to this, over the last four years, the number of MapReduce programs at Google has gone from a small handful of candidate problems in March 2003 (when we started to design MapReduce) to more than 6,000 distinct MapReduce programs in December 2006. These programs were written by more than a thousand different software developers, many of whom had never written a parallel or distributed program before using MapReduce.