mincemeat.py is a Python implementation of the MapReduce distributed computing framework.
mincemeat.py is:
- Lightweight - All of the code is contained in a single Python file (currently weighing in at <13kB) that depends only on the Python Standard Library. Any computer with Python and mincemeat.py can be a part of your cluster.
- Fault tolerant - Workers (clients) can join and leave the cluster at any time without affecting the entire process.
- Secure - mincemeat.py authenticates both ends of every connection, ensuring that only authorized code is executed.
- Open source - mincemeat.py is distributed under the MIT License, and consequently is free for all use, including commercial, personal, and academic, and can be modified and redistributed without restriction.
- Just mincemeat.py (v 0.1.2)
- The full 0.1.2 release (includes documentation and examples)
- Clone this git repository:
https://github.com/michaelfairley/mincemeatpy.git
Let's look at the canonical MapReduce example, word counting:
example.py:
#!/usr/bin/env python
import mincemeat
data = ["Humpty Dumpty sat on a wall",
"Humpty Dumpty had a great fall",
"All the King's horses and all the King's men",
"Couldn't put Humpty together again",
]
def mapfn(k, v):
for w in v.split():
yield w, 1
def reducefn(k, vs):
result = 0
for v in vs:
result += v
return result
s = mincemeat.Server()
# The data source can be any dictionary-like object
s.datasource = dict(enumerate(data))
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
print results
Execute this script on the server:
python example.py
Run mincemeat.py as a worker on a client:
python mincemeat.py -p changeme [server address]
And the server will print out:
{'a': 2, 'on': 1, 'great': 1, 'Humpty': 3, 'again': 1, 'wall': 1, 'Dumpty': 2, 'men': 1, 'had': 1, 'all': 1, 'together': 1, "King's": 2, 'horses': 1, 'All': 1, "Couldn't": 1, 'fall': 1, 'and': 1, 'the': 2, 'put': 1, 'sat': 1}
This example was overly simplistic, but changing the datasource to be a collection of large files and running the client on multiple machines will work just as well. In fact, mincemeat.py has been used to produce a word frequency lists for many gigabytes of text using a slightly modified version of this code.