书名：Learning Functional Programming in Go
作者名：Lex Sheehan
本章字数：340字
更新时间：2021-07-02 23:13:49

MapReduce example

Suppose we have an Apache web server access log files with entries that look like this one:

198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET /example.com/music/js/main.js HTTP/1.1" 200 614 "http://www.example.com/music/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"

What if we are interested in knowing the top 5 most accessed JSON files?

We could perform a MapReduce directly from the terminal using standard Unix string processing commands:

$ cat access10k.log | while read line; do echo "$line" | awk '{print $7}' | grep "\.json";done | sort | uniq -c | sort -nr
 234 /example.com/music/data/artist.json
 232 /example.com/music/data/songs.json
 227 /example.com/music/data/influencers.json
  28 /example.com/music-no-links/data/songs.json
  28 /example.com/music-no-links/data/influencers.json
  28 /example.com/music-no-links/data/artist.json
   8 /example.com/music/data/influencers2.json

That works great for a few thousand lines. If we type time in front of that last command we get something like the following:

real 1m3.932s
user 0m38.125s
sys 0m42.863s

But what if each server has millions of lines of code and we have a lot of servers?

Time for MapReduce!

On each server we can perform our mapping; Starting log file entries as input and resulting in a set of key value pairs:

Next, we take each intermediate result from each server and feed them into our reduce function which then spits out the results:

Our top 5 most requested JSON files might look like this:

85733 /example.com/music/data/artist.json
71938 /example.com/music/data/songs.json
57837 /example.com/music/data/influencers.json
17500 /example.com/music-no-links/data/songs.json
17500 /example.com/music-no-links/data/influencers.json

What can we glean from this example? It looks like good candidates for MapReduce include use cases where:

We have so much data that running it all sequentially on one server would take too long
Our output, from the map phase, consists of a list of key, value pairs
We can run each map or reduce function in isolation, knowing that the output of our function relies only on its input

But what else is going on here that might not be readily apparent?

What else makes this process of Map/Reduce work?

What FP patterns are lurking in the shadows? (Hint: We've already seen it and it has to do with data types.)