Skip to main content

Command Palette

Search for a command to run...

MapReduce For Join Operation

In Guidance of Professor Prakash Parmar - VIT, Mumbai.

Updated
5 min read
V

I am a technical writer, developer and building a Web3 community.

MapReduce Join operation is useful when combining two large datasets. It involves writing lots of code to perform the actual join operation. Joining two datasets begins by comparing the size of each dataset. If one dataset is smaller as compared to the other dataset then smaller dataset is distributed to every data node in the cluster.

Once a join in MapReduce is distributed, either Mapper or Reducer uses the smaller dataset to perform a lookup for matching records from the large dataset and then combine those records to form output records.

Importance of MapReduce

The advancement of many technological trends, such as smart devices, Internet of Things (IoT), cloud computing services, web-based services, and social networks, have contributed to the massive amount of data being generated every day at unprecedented rate. Such technologies have led to the emergence of the era of Big Data, the era of processing Gigabytes, Terabytes, or even Petabytes of data.

MapReduce is an efficient programming model for processing massive datasets. The simplicity of MapReduce framework lies in processing large-scale data as simple units of work, which are map and reduce functions. To create a MapReduce job, the programmer needs only to define the logic within map and reduce functions.

Process of MapReduce

a) Splitting and Starting Tasks

  • MapReduce divides input data into M fixed-size blocks called input splits.

  • Each split is replicated 3 times across the cluster for reliability.

  • These splits are then sent to different nodes in the cluster.

  • The MapReduce framework starts multiple copies of the program on these nodes.

b) Assigning Tasks

  • The master node assigns work to idle worker nodes — either a map task or a reduce task.

  • Map tasks must finish first before any reduce task starts.

  • The output of map tasks is used as input for reduce tasks.

c) Map Task Execution

  • Each map worker processes its input split record by record.

  • It reads each record as a key/value pair, applies a user-defined map function, and produces intermediate key/value pairs.

  • The output is first stored in the worker’s memory buffer.

d) Storing and Partitioning Map Output

  • The buffered output is regularly written to the worker’s local disk.

  • Data is divided into R partitions using a partitioning function.

  • The locations of these partitions are sent to the master node, which then informs reduce workers.

e) Data Transfer and Sorting (Reduce Phase Start)

  • Each reduce worker fetches its data partition from map workers’ disks using remote calls.

  • Once data is copied, the reduce worker sorts it by key to group same-key values together.

  • If data is too large for memory, an external sort (disk-based sort) is used.

f) Reduce Task Execution

  • The reduce worker processes each group of keys using a user-defined reduce function.

  • It combines all values of a key into a reduced result.

  • The final reduced key/value pairs are written to the final output file.

Types of Join

Depending upon the place where the actual join is performed, joins in Hadoop are classified into-

1. Map-side join – When the join is performed by the mapper, it is called as map-side join. In this type, the join is performed before data is actually consumed by the map function. It is mandatory that the input to each map is in the form of a partition and is in sorted order. Also, there must be an equal number of partitions and it must be sorted by the join key.

2. Reduce-side join – When the join is performed by the reducer, it is called as reduce-side join. There is no necessity in this join to have a dataset in a structured form (or partitioned).

Here, map side processing emits join key and corresponding tuples of both the tables. As an effect of this processing, all the tuples with same join key fall into the same reducer which then joins the records with same join key.

How to Join two DataSets: MapReduce Example

There are two Sets of Data in two Different Files (shown below). The Key Dept_ID is common in both files. The goal is to use MapReduce Join to combine these files.

File 1

File 1

MapReduce Example

File 2

Input: The input data set is a txt file, DeptName.txt & DepStrength.txt

Ensure you have Hadoop installed. Before you start with the MapReduce Join example actual process, change user to ‘hduser’ (id used while Hadoop configuration, you can switch to the userid used during your Hadoop config ).

su - hduser_

Step 1) Copy the zip file to the location of your choice

Step 2) Uncompress the Zip File

sudo tar -xvf MapReduceJoin.tar.gz

Step 3) Go to directory MapReduceJoin/

cd MapReduceJoin/

Step 4) Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Step 5) DeptStrength.txt and DeptName.txt are the input files used for this MapReduce Join example program.

These file needs to be copied to HDFS using below command -

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal DeptStrength.txt DeptName.txt /

Step 6) Run the program using below command -

$HADOOP_HOME/bin/hadoop jar MapReduceJoin.jar MapReduceJoin/JoinDriver/DeptStrength.txt /DeptName.txt /output_mapreducejoin

Step 7) After execution, output file (named ‘part-00000’) will stored in the directory /output_mapreducejoin on HDFS

Results can be seen using the command line interface

$HADOOP_HOME/bin/hdfs dfs -cat /output_mapreducejoin/part-00000

Now select ‘Browse the filesystem’ and navigate upto /output_mapreducejoin

Open part-r-00000

Results are shown.

In Guidance of Professor Prakash Parmar - VIT, Mumbai.