- Learning Functional Programming in Go
- Lex Sheehan
- 340字
- 2021-07-02 23:13:49
MapReduce example
Suppose we have an Apache web server access log files with entries that look like this one:
198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET /example.com/music/js/main.js HTTP/1.1" 200 614 "http://www.example.com/music/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"
What if we are interested in knowing the top 5 most accessed JSON files?
We could perform a MapReduce directly from the terminal using standard Unix string processing commands:
$ cat access10k.log | while read line; do echo "$line" | awk '{print $7}' | grep "\.json";done | sort | uniq -c | sort -nr
234 /example.com/music/data/artist.json
232 /example.com/music/data/songs.json
227 /example.com/music/data/influencers.json
28 /example.com/music-no-links/data/songs.json
28 /example.com/music-no-links/data/influencers.json
28 /example.com/music-no-links/data/artist.json
8 /example.com/music/data/influencers2.json
That works great for a few thousand lines. If we type time in front of that last command we get something like the following:
real 1m3.932s
user 0m38.125s
sys 0m42.863s
But what if each server has millions of lines of code and we have a lot of servers?
Time for MapReduce!
On each server we can perform our mapping; Starting log file entries as input and resulting in a set of key value pairs:
Next, we take each intermediate result from each server and feed them into our reduce function which then spits out the results:
Our top 5 most requested JSON files might look like this:
85733 /example.com/music/data/artist.json
71938 /example.com/music/data/songs.json
57837 /example.com/music/data/influencers.json
17500 /example.com/music-no-links/data/songs.json
17500 /example.com/music-no-links/data/influencers.json
What can we glean from this example? It looks like good candidates for MapReduce include use cases where:
- We have so much data that running it all sequentially on one server would take too long
- Our output, from the map phase, consists of a list of key, value pairs
- We can run each map or reduce function in isolation, knowing that the output of our function relies only on its input
But what else is going on here that might not be readily apparent?
What else makes this process of Map/Reduce work?
What FP patterns are lurking in the shadows? (Hint: We've already seen it and it has to do with data types.)