{"id":1377,"date":"2015-01-19T19:10:03","date_gmt":"2015-01-19T19:10:03","guid":{"rendered":"http:\/\/www.phpmind.com\/blog\/?p=1377"},"modified":"2015-01-19T19:12:05","modified_gmt":"2015-01-19T19:12:05","slug":"what-is-mapreduce","status":"publish","type":"post","link":"https:\/\/www.phpmind.com\/blog\/2015\/01\/what-is-mapreduce\/","title":{"rendered":"What is MapReduce?"},"content":{"rendered":"<p><iframe loading=\"lazy\" width=\"560\" height=\"315\" src=\"\/\/www.youtube.com\/embed\/8wjvMyc01QY\" frameborder=\"0\" allowfullscreen><\/iframe><br \/>\nAbout MapReduce<br \/>\nMapReduce is the heart of Hadoop\u00ae. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions.<\/p>\n<p>For people new to this topic, it can be somewhat difficult to grasp, because it\u2019s not typically something people have been exposed to previously. If you\u2019re new to Hadoop\u2019s MapReduce jobs, don\u2019t worry: we\u2019re going to describe it in a way that gets you up to speed quickly.<br \/>\nThe term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key\/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.<\/p>\n<p>An example of MapReduce<br \/>\nLet\u2019s look at a simple example. Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course we\u2019ve made this example very simple so it\u2019s easy to follow. You can imagine that a real application won\u2019t be quite so simple, as it\u2019s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we\u2019re covering here remain the same. Either way, in this example, city is the key and temperature is the value.<br \/>\nToronto, 20<br \/>\nWhitby, 25<br \/>\nNew York, 22<br \/>\nRome, 32<br \/>\nToronto, 4<br \/>\nRome, 33<br \/>\nNew York, 18<br \/>\nOut of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this:<br \/>\n(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)<br \/>\nLet\u2019s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:<br \/>\n(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)<br \/>\nAll five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:<br \/>\n(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)<br \/>\nAs an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>About MapReduce MapReduce is the heart of Hadoop\u00ae. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions. For people new to this topic, it can [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[49],"tags":[],"class_list":["post-1377","post","type-post","status-publish","format-standard","hentry","category-mapreduce"],"_links":{"self":[{"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/posts\/1377","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/comments?post=1377"}],"version-history":[{"count":2,"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/posts\/1377\/revisions"}],"predecessor-version":[{"id":1379,"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/posts\/1377\/revisions\/1379"}],"wp:attachment":[{"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/media?parent=1377"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/categories?post=1377"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.phpmind.com\/blog\/wp-json\/wp\/v2\/tags?post=1377"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}