SQL GROUP BY allows you to perform aggregate functions on data
sets; To count all of the stores in each state, to average a series of
related numbers, etc. MongoDB has
some aggregate functions but they are fairly limited in scope. The MongoDB group function also suffers from the fact that it does not work on
sharded
configurations. So how do you perform grouped queries using MongoDB? By
using MapReduce functions of course (you read the title right?)
Understanding MapReduce
Understanding MapReduce requires, or at least is made much easier by, understanding functional programming concepts.
map and
reduce (fold, inject) are functions that come from Lisp and have been inherited by a lot of languages (Scheme, Smalltalk, Ruby, Python).
- map
- A higher-order function which transforms a list by applying a
function to each of its elements. Its return value is the transformed
list. In MongoDB terms, the map is a function that is run for each
Document in a collection and can return a value for that row to be
included in the transformed list.
- reduce
- A higher-order function that iterates an arbitrary function over a
data structure and builds up a return value. The reduce function takes
the values returned by map and allows you to run a function to
manipulate those values in some way.
Some Examples
Let’s start with some sample data:
db.factories.insert( { name: "Miller", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Lakefront", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Point", metro: { city: "Steven's Point", state: "WI" } } );
db.factories.insert( { name: "Pabst", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Blatz", metro: { city: "Milwaukee", state: "WI" } } );
db.factories.insert( { name: "Coors", metro: { city: "Golden Springs", state: "CO" } } );
db.factories.find()
Lets say I want to count the number of factories in each of the
cities (ignore the fact that I could have the same city in more than one
state, I don’t in my data). For a count, I write a function that
“emits” the group by key and a value that you can count. It can be any
value, but for simplicity I’ll make it 1.
emit() is a MongoDB server-side function that you use to identify a value in a row that should be added to the transformed list. If
emit() is not called then the values for that row will be excluded from the results.
mapCity = function () {
emit(this.metro.city, 1);
}
The next piece is the
reduce() function. The reduce function will be passed a key and an array of values that were collected by the
map()
function. I know my map function returns a 1 for each row keyed by
city. So the reduce function will be called with a key “Golden Springs”
and a single-element array containing a 1. For “Milwaukee” it will be
passed an 4-element array of 1s.
reduceCount = function (k, vals) {
var sum = 0;
for (var i in vals) {
sum += vals[i];
}
return sum;
}
With those 2 functions I can call the mapReduce function to perform my Query.
res = db.factories.mapReduce(mapCity, reduceCount)
db[res.result].find()
This results in:
{ "_id" : "Golden Springs", "value" : 1 }
{ "_id" : "Milwaukee", "value" : 4 }
{ "_id" : "Steven's Point", "value" : 1 }
Counting is not the only thing I can do of course. Anything can be
returned by the map function including complex JSON objects. In this
example I combine the names of all of the Factories in a given City into
a simple comma-separated list.
mapCity = function () {
emit(this.metro.city, this.name);
}
reduceNames = function (k, vals) {
return vals.join(",");
}
res = db.factories.mapReduce(mapCity, reduceNames)
db[res.result].find()
Give you:
{ "_id" : "Golden Springs", "value" : "Coors" }
{ "_id" : "Milwaukee", "value" : "Miller,Lakefront,Pabst,Blatz" }
{ "_id" : "Steven's Point", "value" : "Point" }