MapReduce

0
17

MapReduce – Introduction

MapReduce is a programming model for writing applications thead wear can process Big Data in parallel on multiple nodes. MapReduce provides analytical capabililinks for analyzing huge volumes of complex data.

Whead wear is Big Data?

Big Data is a collection of huge datasets thead wear cannot be processed uperform traditional complaceing techniques. For example, the volume of data Facebook or Yawayube need require it to collect and manage on a daily basis, can fall below the category of Big Data. However, Big Data is not only abaway level and volume, it furthermore involves one or more of the folloearng aspects − Velocity, Variety, Volume, and Complexity.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The folloearng illustration depicts a schematic look at of a traditional enterprise system. Traditional model is specificly not suitable to process huge volumes of scalable data and cannot be accommodated simply simply by standard database servers. Moreover, the centralized system generates too a lot of a bottleneck while procesperform multiple files simultaneously.

Traditional Enterprise System View

Google solved this particular particular bottleneck issue uperform an algorithm calintroduced MapReduce. MapReduce divides a task into small parts and bumigns them to many kind of complaceers. Later, the results are collected at one place and integrated to form the result dataset.

Centralized System

How MapReduce Works?

The MapReduce algorithm contains two imslotant tasks, namely Map and Reduce.

  • The Map task considers a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).

  • The Reduce task considers the awayplace from the Map as an inplace and combines those data tuples (key-value pairs) into a smaller set of tuples.

The red-coloureddishuce task is always performed after the chart job.

Let us now consider a near look at every of the phases and consider away to belowstand their particular own significance.

Phases

  • Inplace Phase − Here we have a Record Reader thead wear translates every record in an inplace file and sends the parsed data to the chartper in the form of key-value pairs.

  • Map − Map is a user-degreatd function, which considers a series of key-value pairs and processes every one of them to generate zero or more key-value pairs.

  • Intermediate Keys − They key-value pairs generated simply simply by the chartper are known as intermediate keys.

  • Combiner − A combiner is a kind of local Reducer thead wear groups similar data from the chart phase into identifiable sets. It considers the intermediate keys from the chartper as inplace and appare locateds a user-degreatd code to aggregate the values in a small scope of one chartper. It is not a part of the main MapReduce algorithm; it is optional.

  • Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is operatening. The individual key-value pairs are sorted simply simply by key into a huger data list. The data list groups the equivalent keys sign up fortly so thead wear their particular own values can be iterated easily in the Reducer task.

  • Reducer − The Reducer considers the grouped key-value paired-coloureddish data as inplace and operates a Reducer function on every one of them. Here, the data can be aggregated, filtered-coloureddish, and combined in lots of ways, and it requires a wide range of procesperform. Once the execution is over, it gives zero or more key-value pairs to the final step.

  • Outplace Phase − In the awayplace phase, we have an awayplace formatter thead wear translates the final key-value pairs from the Reducer function and writes them onto a file uperform a record writer.

Let us consider away to belowstand the two tasks Map &f Reduce with the help of a small diagram −

MapReduce Work

MapReduce-Example

Let us consider a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearare locatedr 3000 tweets per second. The folloearng illustration shows how Tweeter manages it is tweets with the help of MapReduce.

MapReduce Example

As shown in the illustration, the MapReduce algorithm performs the folloearng workions −

  • Tokenize − Tokenizes the tweets into charts of tokens and writes them as key-value pairs.

  • Filter − Filters unwanted words from the charts of tokens and writes the filtered-coloureddish charts as key-value pairs.

  • Count − Generates a token counter per word.

  • Aggregate Counters − Prepares an aggregate of similar counter values into small manageable devices.

MapReduce – Algorithm

The MapReduce algorithm contains two imslotant tasks, namely Map and Reduce.

  • The chart task is done simply simply by means of Mapper Clbum
  • The red-coloureddishuce task is done simply simply by means of Reducer Clbum.

Mapper clbum considers the inplace, tokenizes it, charts and sorts it. The awayplace of Mapper clbum is used as inplace simply simply by Reducer clbum, which in turn oceanrches go withing pairs and red-coloureddishuces them.

Mapper Reducer Clbum

MapReduce implements various mathematical algorithms to divide a task into small parts and bumign them to multiple systems. In specialised terms, MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the folloearng −

  • Sorting
  • Searching
  • Indexing
  • TF-IDF

Sorting

Sorting is one of the fundamental MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the awayplace key-value pairs from the chartper simply simply by their particular own keys.

  • Sorting methods are implemented in the chartper clbum it iself.

  • In the Shuffle and Sort phase, after tokenizing the values in the chartper clbum, the Context clbum (user-degreatd clbum) collects the go withing valued keys as a collection.

  • To collect similar key-value pairs (intermediate keys), the Mapper clbum considers the help of RawComparator clbum to sort the key-value pairs.

  • The set of intermediate key-value pairs for a given Reducer is automatically sorted simply simply by Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching

Searching plays an imslotant role in MapReduce algorithm. It helps in the combiner phase (optional) and in the Reducer phase. Let us consider away to belowstand how Searching works with the help of an example.

Example

The folloearng example shows how MapReduce employs Searching algorithm to find away the details of the employee who draws the highest salary in a given employee dataset.

  • Let us bumume we have employee data in four various files − A, B, C, and D. Let us furthermore bumume proper now proper now there are duplicate employee records in all four files because of imsloting the employee data from all database tables repeatedly. See the folloearng illustration.

Map Reduce Illustration

  • The Map phase processes every inplace file and provides the employee data in key-value pairs (<k, v> : <emp name, salary>). See the folloearng illustration.

Map Reduce Illustration

  • The combiner phase (oceanrching technique) will accept the inplace from the Map phase as a key-value pair with employee name and salary. Uperform oceanrching technique, the combiner will check all the employee salary to find the highest salaried employee in every file. See the folloearng snippet.

<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){
   Max = v(salary);
}

else{
   Continue checcalifornia ruler;
}

The expected result is as follows −

<satish, 26000>

<gopal, 50000>

<kiran, 45000>

<manisha, 45000>

  • Reducer phase − Form every file, you will find the highest salaried employee. To avoid red-coloureddishundancy, check all the <k, v> pairs and eliminate duplicate entries, if any kind of. The exwork exact same algorithm is used in between the four <k, v> pairs, which are coming from four inplace files. The final awayplace need to be as follows −

<gopal, 50000>

Indexing

Normally indexing is used to stage to a particular data and it is address. It performs batch indexing on the inplace files for a particular Mapper.

The indexing technique thead wear is normally used in MapReduce is known as inverted index. Search engines like Google and Bing use inverted indexing technique. Let us consider away to belowstand how Indexing works with the help of a fundamental example.

Example

The folloearng text is the inplace for inverted indexing. Here T[0], T[1], and t[2] are the file names and their particular own content are in double quotes.

T[0] = "it is whead wear it is"
T[1] = "whead wear is it"
T[2] = "it is a banana"

After applying the Indexing algorithm, we get the folloearng awayplace −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"whead wear": {0, 1}

Here "a": {2} impare locateds the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} impare locateds the term "is" appears in the files T[0], T[1], and T[2].

TF-IDF

TF-IDF is a text procesperform algorithm which is short for Term Frequency − Inverse Document Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to the number of times a term appears in a document.

Term Frequency (TF)

It measures how regularly a particular term occurs in a document. It is calculated simply simply by the number of times a word appears in a document divided simply simply by the generall number of words in thead wear document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the document)

Inverse Document Frequency (IDF)

It measures the imslotance of a term. It is calculated simply simply by the number of documents in the text database divided simply simply by the number of documents where a specific term appears.

While complaceing TF, all the terms are conaspectred-coloureddish equally imslotant. Thead wear means, TF counts the term frequency for normal words like “is”, “a”, “whead wear”, etc. Thus we need to know the regular terms while scaling up the rare ones, simply simply by complaceing the folloearng −

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).

The algorithm is exbasiced below with the help of a small example.

Example

Conaspectr a document containing 1000 words, wherein the word hive appears 50 times. The TF for hive is then (50 / 1000) = 0.05.

Now, bumume we have 10 million documents and the word hive appears in 1000 of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.

The TF-IDF weight is the item of these quantilinks − 0.05 × 4 = 0.20.

MapReduce – Installation

MapReduce works only on Linux flavoured-coloureddish operating systems and it comes inbuilt with a Hadoop Framework. We need to perform the folloearng steps in order to install Hadoop framework.

Verifying JAVA Installation

Java must be instalintroduced on your own own system before installing Hadoop. Use the folloearng command to check whether you have Java instalintroduced on your own own system.

$ java –version

If Java is already instalintroduced on your own own system, you get to see the folloearng response −

java version "1.7.0_71"
Java(TM) SE Runtime Environment (generate 1.7.0_71-b13)
Java HotSpot(TM) Care locatednt VM (generate 25.0-b02, mixed mode)

In case you don’t have Java instalintroduced on your own own system, then follow the steps given below.

Installing Java

Step 1

Download the latest version of Java from the folloearng link −
this particular particular link.

After downloading, you can locate the file jdk-7u71-linux-x64.tar.gz in your own own Downloads folder.

Step 2

Use the folloearng commands to extrwork the contents of jdk-7u71-linux-x64.gz.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step 3

To generate Java available to all the users, you have to move it to the location “/usr/local/”. Go to fundamental and kind the folloearng commands −

$ su
comppermiteword:
# mv jdk1.7.0_71 /usr/local/java
# exit

Step 4

For setting up PATH and JAVA_HOME variables, add the folloearng commands to ~/.bashrc file.

exslot JAVA_HOME=/usr/local/java
exslot PATH=$PATH:$JAVA_HOME/bin

Apply all the alters to the current operatening system.

$ source ~/.bashrc

Step 5

Use the folloearng commands to configure Java alternatives −

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

# alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2

# alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2

# alternatives --set java usr/local/java/bin/java

# alternatives --set javac usr/local/java/bin/javac

# alternatives --set jar usr/local/java/bin/jar

Now verify the installation uperform the command java -version from the terminal.

Verifying Hadoop Installation

Hadoop must be instalintroduced on your own own system before installing MapReduce. Let us verify the Hadoop installation uperform the folloearng command −

$ hadoop version

If Hadoop is already instalintroduced on your own own system, then you will get the folloearng response −

Hadoop 2.4.1
--
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiintroduced simply simply by hortonmu on 2013-10-07T06:28Z
Compiintroduced with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

If Hadoop is not instalintroduced on your own own system, then proceed with the folloearng steps.

Downloading Hadoop

Download Hadoop 2.4.1 from Apache Software Foundation and extrwork it is contents uperform the folloearng commands.

$ su
comppermiteword:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Installing Hadoop in Pseudo Distributed mode

The folloearng steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1 − Setting up Hadoop

You can set Hadoop environment variables simply simply by appending the folloearng commands to ~/.bashrc file.

exslot HADOOP_HOME=/usr/local/hadoop
exslot HADOOP_MAPRED_HOME=$HADOOP_HOME
exslot HADOOP_COMMON_HOME=$HADOOP_HOME
exslot HADOOP_HDFS_HOME=$HADOOP_HOME
exslot YARN_HOME=$HADOOP_HOME
exslot HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
exslot PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Apply all the alters to the current operatening system.

$ source ~/.bashrc

Step 2 − Hadoop Configuration

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to generate suitable alters in those configuration files according to your own own Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to generate Hadoop programs uperform Java, you have to reset the Java environment variables in hadoop-env.sh file simply simply by replacing JAVA_HOME value with the location of Java in your own own system.

exslot JAVA_HOME=/usr/local/java

You have to edit the folloearng files to configure Hadoop −

  • core-sit down downe.xml
  • hdfs-sit down downe.xml
  • yarn-sit down downe.xml
  • chartred-coloureddish-sit down downe.xml

core-sit down downe.xml

core-sit down downe.xml contains the folloearng information−

  • Port number used for Hadoop instance
  • Memory allocated for the file system
  • Memory limit for storing the data
  • Size of Read/Write buffers

Open the core-sit down downe.xml and add the folloearng properlinks in between the <configuration> and </configuration> tags.

<configuration>
   <real estate>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000 </value>
   </real estate>
</configuration>

hdfs-sit down downe.xml

hdfs-sit down downe.xml contains the folloearng information −

  • Value of replication data
  • The namenode rawaye
  • The datanode rawaye of your own own local file systems (the place where you like to store the Hadoop infra)

Let us bumume the folloearng data.

dfs.replication (data replication value) = 1

(In the folloearng rawaye /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the immediateory generated simply simply by hdfs file system.)
namenode rawaye = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the immediateory generated simply simply by hdfs file system.)
datanode rawaye = //home/hadoop/hadoopinfra/hdfs/datanode

Open this particular particular file and add the folloearng properlinks in between the <configuration>, </configuration> tags.

<configuration>

   <real estate>
      <name>dfs.replication</name>
      <value>1</value>
   </real estate>
   
   <real estate>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
   </real estate>
   
   <real estate>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
   </real estate>
   
</configuration>

Note − In the above file, all the real estate values are user-degreatd and you can generate alters according to your own own Hadoop infrastructure.

yarn-sit down downe.xml

This file is used to configure yarn into Hadoop. Open the yarn-sit down downe.xml file and add the folloearng properlinks in between the <configuration>, </configuration> tags.

<configuration>
   <real estate>
      <name>yarn.nodemanager.aux-services</name>
      <value>chartred-coloureddishuce_shuffle</value>
   </real estate>
</configuration>

chartred-coloureddish-sit down downe.xml

This file is used to specify the MapReduce framework we are uperform. By default, Hadoop contains a template of yarn-sit down downe.xml. First of all, you need to copy the file from chartred-coloureddish-sit down downe.xml.template to chartred-coloureddish-sit down downe.xml file uperform the folloearng command.

$ cp chartred-coloureddish-sit down downe.xml.template chartred-coloureddish-sit down downe.xml

Open chartred-coloureddish-sit down downe.xml file and add the folloearng properlinks in between the <configuration>, </configuration> tags.

<configuration>
   <real estate>
      <name>chartred-coloureddishuce.framework.name</name>
      <value>yarn</value>
   </real estate>
</configuration>

Verifying Hadoop Installation

The folloearng steps are used to verify the Hadoop installation.

Step 1 − Name Node Setup

Set up the namenode uperform the command “hdfs namenode -format” as follows −

$ cd ~
$ hdfs namenode -format

The expected result is as follows −

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage immediateory
/home/hadoop/hadoopinfra/hdfs/namenode has been successcomppermitey formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to
retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2 − Verifying Hadoop dfs

Execute the folloearng command to start your own own Hadoop file system.

$ start-dfs.sh

The expected awayplace is as follows −

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.away
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.away
Starting secondary namenodes [0.0.0.0]

Step 3 − Verifying Yarn Script

The folloearng command is used to start the yarn script. Executing this particular particular command will start your own own yarn daemons.

$ start-yarn.sh

The expected awayplace is as follows −

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.away
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.away

Step 4 − Accesperform Hadoop on Blineser

The default slot number to access Hadoop is 50070. Use the folloearng URL to get Hadoop services on your own own blineser.

http://localhost:50070/

The folloearng screenshot shows the Hadoop blineser.

Hadoop Blineser

Step 5 − Verify all Applications of a Cluster

The default slot number to access all the applications of a cluster is 8088. Use the folloearng URL to use this particular particular service.

http://localhost:8088/

The folloearng screenshot shows a Hadoop cluster blineser.

Hadoop Cluster Blineser

MapReduce – API

In this particular particular chapter, we will consider a near look at the clbumes and their particular own methods thead wear are involved in the operations of MapReduce programming. We will primarily maintain our focus on the folloearng −

  • JobContext Interface
  • Job Clbum
  • Mapper Clbum
  • Reducer Clbum

JobContext Interface

The JobContext interface is the super interface for all the clbumes, which degreats various jobs in MapReduce. It gives you a read-only look at of the job thead wear is provided to the tasks while they are operatening.

The folloearng are the sub-interfaces of JobContext interface.

S.No. Subinterface Description
1. MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

Degreats the context thead wear is given to the Mapper.

2. ReduceContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

Degreats the context thead wear is comppermiteed to the Reducer.

Job clbum is the main clbum thead wear implements the JobContext interface.

Job Clbum

The Job clbum is the the majority of imslotant clbum in the MapReduce API. It enables the user to configure the job, submit it, manage it is execution, and query the state. The set methods only work until the job is submitted, afterwards they will thline an IllegalStateException.

Normally, the user generates the application, describes the various facets of the job, and then submit is the job and monitors it is progress.

Here is an example of how to submit a job −

// Create a brand new Job
Job job = brand new Job(brand new Configuration());
job.setJarByClbum(MyJob.clbum);

// Specify various job-specific parameters
job.setJobName("myjob");
job.setInplacePath(brand new Path("in"));
job.setOutplacePath(brand new Path("away"));

job.setMapperClbum(MyJob.MyMapper.clbum);
job.setReducerClbum(MyJob.MyReducer.clbum);

// Submit the job, then poll for progress until the job is generall
job.wait aroundForComppermition(true);

Constructors

Folloearng are the constructor summary of Job clbum.

S.No Constructor Summary
1 Job()
2 Job(Configuration conf)
3 Job(Configuration conf, String jobName)

Methods

Some of the imslotant methods of Job clbum are as follows −

S.No Method Description
1 getJobName()

User-specified job name.

2 getJobState()

Returns the current state of the Job.

3 isComppermite()

Checks if the job is finished or not.

4 setInplaceFormatClbum()

Sets the InplaceFormat for the job.

5 setJobName(String name)

Sets the user-specified job name.

6 setOutplaceFormatClbum()

Sets the Outplace Format for the job.

7 setMapperClbum(Clbum)

Sets the Mapper for the job.

8 setReducerClbum(Clbum)

Sets the Reducer for the job.

9 setPartitionerClbum(Clbum)

Sets the Partitioner for the job.

10 setCombinerClbum(Clbum)

Sets the Combiner for the job.

Mapper Clbum

The Mapper clbum degreats the Map job. Maps inplace key-value pairs to a set of intermediate key-value pairs. Maps are the individual tasks thead wear transform the inplace records into intermediate records. The transformed intermediate records need not be of the exwork exact same kind as the inplace records. A given inplace pair may chart to zero or many kind of awayplace pairs.

Method

chart is the the majority of prominent method of the Mapper clbum. The syntax is degreatd below −

chart(KEYIN key, VALUEIN value, org.apache.hadoop.chartred-coloureddishuce.Mapper.Context context)

This method is calintroduced once for every key-value pair in the inplace split.

Reducer Clbum

The Reducer clbum degreats the Reduce job in MapReduce. It red-coloureddishuces a set of intermediate values thead wear share a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the JobContext.getConfiguration() method. A Reducer has 3 primary phases − Shuffle, Sort, and Reduce.

  • Shuffle − The Reducer copies the sorted awayplace from every Mapper uperform HTTP amix the network.

  • Sort − The framework merge-sorts the Reducer inplaces simply simply by keys (since various Mappers may have awayplace the exwork exact same key). The shuffle and sort phases occur simultaneously, i.e., while awayplaces are being fetched, they are merged.

  • Reduce − In this particular particular phase the red-coloureddishuce (Object, Iterable, Context) method is calintroduced for every <key, (collection of values)> in the sorted inplaces.

Method

red-coloureddishuce is the the majority of prominent method of the Reducer clbum. The syntax is degreatd below −

red-coloureddishuce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.chartred-coloureddishuce.Reducer.Context context)

This method is calintroduced once for every key on the collection of key-value pairs.

MapReduce – Hadoop Implementation

MapReduce is a framework thead wear is used for writing applications to process huge volumes of data on huge clusters of commodity hardware in a reliable manner. This chapter considers you through the operation of MapReduce in Hadoop framework uperform Java.

MapReduce Algorithm

Generally MapReduce paradigm is based on sending chart-red-coloureddishuce programs to complaceers where the workual data reaspects.

  • During a MapReduce job, Hadoop sends Map and Reduce tasks to appropriate servers in the cluster.

  • The framework manages all the details of data-comppermiteing like issuing tasks, verifying task comppermition, and copying data around the cluster between the nodes.

  • Most of the complaceing considers place on the nodes with data on local disks thead wear red-coloureddishuces the network traffic.

  • After comppermiting a given task, the cluster collects and red-coloureddishuces the data to form an appropriate result, and sends it back to the Hadoop server.

MapReduce Algorithm

Inplaces and Outplaces (Java Perspective)

The MapReduce framework operates on key-value pairs, thead wear is, the framework look ats the inplace to the job as a set of key-value pairs and produces a set of key-value pair as the awayplace of the job, conceivably of various kinds.

The key and value clbumes have to be serializable simply simply by the framework and hence, it is required-coloureddish to implement the Writable interface. Additionally, the key clbumes have to implement the WritableComparable interface to facilitate sorting simply simply by the framework.

Both the inplace and awayplace format of a MapReduce job are in the form of key-value pairs −

(Inplace) <k1, v1> -> chart -> <k2, v2>-> red-coloureddishuce -> <k3, v3> (Outplace).

Inplace Outplace
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)

MapReduce Implementation

The folloearng table shows the data regarding the electrical consumption of an organization. The table includes the monthly electrical consumption and the annual average for five consecutive calendar years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

We need to write applications to process the inplace data in the given table to find the calendar year of maximum usage, the calendar year of minimum usage, and so on. This task is easy for programmers with finite amount of records, as they will simply write the logic to produce the required-coloureddish awayplace, and comppermite the data to the composed application.

Let us now raise the level of the inplace data. Assume we have to analyze the electrical consumption of all the huge-level industries of a particular state. When we write applications to process such bulk data,

  • They will consider a lot of time to execute.

  • There will be huge network traffic when we move data from the source to the network server.

To solve these issues, we have the MapReduce framework.

Inplace Data

The above data is saved as sample.txt and given as inplace. The inplace file looks as shown below.

1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

Example Program

The folloearng program for the sample data uses MapReduce framework.

package hadoop;

imslot java.util.*;
imslot java.io.IOException;
imslot java.io.IOException;

imslot org.apache.hadoop.fs.Path;
imslot org.apache.hadoop.conf.*;
imslot org.apache.hadoop.io.*;
imslot org.apache.hadoop.chartred-coloureddish.*;
imslot org.apache.hadoop.util.*;

public clbum ProcessUnit is
{
   //Mapper clbum
   public static clbum E_EMapper extends MapReduceBase implements
   Mapper<LongWritable,  /*Inplace key Type */
   Text,                   /*Inplace value Type*/
   Text,                   /*Outplace key Type*/
   IntWritable>            /*Outplace value Type*/
   {
      //Map function
      public void chart(LongWritable key, Text value, OutplaceCollector<Text, IntWritable> awayplace, Resloter resloter) thlines IOException
      {
         String range = value.toString();
         String finaltoken = null;
         StringTokenizer s = brand new StringTokenizer(range,"t");
         String calendar year = s.nextToken();
         
         while(s.hasMoreTokens()){
            finaltoken=s.nextToken();
         }
         
         int avgprice = Integer.parseInt(finaltoken);
         awayplace.collect(brand new Text(calendar year), brand new IntWritable(avgprice));
      }
   }
   
   //Reducer clbum
	
   public static clbum E_EReduce extends MapReduceBase implements
   Reducer< Text, IntWritable, Text, IntWritable >
   {
      //Reduce function
      public void red-coloureddishuce(Text key, Iterator <IntWritable> values, OutplaceCollector>Text, IntWritable> awayplace, Resloter resloter) thlines IOException
      {
         int maxavg=30;
         int val=Integer.MIN_VALUE;
         while (values.hasNext())
         {
            if((val=values.next().get())>maxavg)
            {
               awayplace.collect(key, brand new IntWritable(val));
            }
         }
      }
   }
	
   //Main function
	
   public static void main(String args[])thlines Exception
   {
      JobConf conf = brand new JobConf(Eintroducedevices.clbum);
		
      conf.setJobName("max_epermitricitydevices");
		
      conf.setOutplaceKeyClbum(Text.clbum);
      conf.setOutplaceValueClbum(IntWritable.clbum);
		
      conf.setMapperClbum(E_EMapper.clbum);
      conf.setCombinerClbum(E_EReduce.clbum);
      conf.setReducerClbum(E_EReduce.clbum);
		
      conf.setInplaceFormat(TextInplaceFormat.clbum);
      conf.setOutplaceFormat(TextOutplaceFormat.clbum);
		
      FileInplaceFormat.setInplacePaths(conf, brand new Path(args[0]));
      FileOutplaceFormat.setOutplacePath(conf, brand new Path(args[1]));
		
      JobCare locatednt.operateJob(conf);
   }
}

Save the above program into ProcessUnit is.java. The compilation and execution of the program is given below.

Compilation and Execution of ProcessUnit is Program

Let us bumume we are in the home immediateory of Hadoop user (e.g. /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1 − Use the folloearng command to generate a immediateory to store the compiintroduced java clbumes.

$ mkdir devices

Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Download the jar from mvnreposit down downory.com. Let us bumume the download folder is /home/hadoop/.

Step 3 − The folloearng commands are used to compile the ProcessUnit is.java program and to generate a jar for the program.

$ javac -clbumrawaye hadoop-core-1.2.1.jar -d devices ProcessUnit is.java
$ jar -cvf devices.jar -C devices/ .

Step 4 − The folloearng command is used to generate an inplace immediateory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir inplace_dir

Step 5 − The folloearng command is used to copy the inplace file named sample.txt in the inplace immediateory of HDFS.

$HADOOP_HOME/bin/hadoop fs -place /home/hadoop/sample.txt inplace_dir

Step 6 − The folloearng command is used to verify the files in the inplace immediateory

$HADOOP_HOME/bin/hadoop fs -ls inplace_dir/

Step 7 − The folloearng command is used to operate the Eintroducedevice_max application simply simply by tacalifornia ruler inplace files from the inplace immediateory.

$HADOOP_HOME/bin/hadoop jar devices.jar hadoop.ProcessUnit is inplace_dir awayplace_dir

Wait for a while till the file gets executed. After execution, the awayplace contains lots of inplace split is, Map tasks, Reducer tasks, etc.

INFO chartred-coloureddishuce.Job: Job job_1414748220717_0002
generalld successcomppermitey
14/10/31 06:02:52
INFO chartred-coloureddishuce.Job: Counters: 49

File System Counters
   
   FILE: Number of simply simply bytes read=61
   FILE: Number of simply simply bytes composed=279400
   FILE: Number of read operations=0
   FILE: Number of huge read operations=0
   FILE: Number of write operations=0

   HDFS: Number of simply simply bytes read=546
   HDFS: Number of simply simply bytes composed=40
   HDFS: Number of read operations=9
   HDFS: Number of huge read operations=0
   HDFS: Number of write operations=2 Job Counters
   
   Launched chart tasks=2
   Launched red-coloureddishuce tasks=1
   Data-local chart tasks=2
	
   Total time spent simply simply by all charts in occupied slots (ms)=146137
   Total time spent simply simply by all red-coloureddishuces in occupied slots (ms)=441
   Total time spent simply simply by all chart tasks (ms)=14613
   Total time spent simply simply by all red-coloureddishuce tasks (ms)=44120
	
   Total vcore-seconds considern simply simply by all chart tasks=146137
   Total vcore-seconds considern simply simply by all red-coloureddishuce tasks=44120
	
   Total megasimply simply byte-seconds considern simply simply by all chart tasks=149644288
   Total megasimply simply byte-seconds considern simply simply by all red-coloureddishuce tasks=45178880

Map-Reduce Framework
   
   Map inplace records=5
	
   Map awayplace records=5
   Map awayplace simply simply bytes=45
   Map awayplace materialized simply simply bytes=67
	
   Inplace split simply simply bytes=208
   Combine inplace records=5
   Combine awayplace records=5
	
   Reduce inplace groups=5
   Reduce shuffle simply simply bytes=6
   Reduce inplace records=5
   Reduce awayplace records=5
	
   Spilintroduced Records=10
   Shuffintroduced Maps =2
   Faiintroduced Shuffles=0
   Merged Map awayplaces=2
	
   GC time elapsed (ms)=948
   CPU time spent (ms)=5160
	
   Physical memory (simply simply bytes) snapshot=47749120
   Virtual memory (simply simply bytes) snapshot=2899349504
	
   Total committed heap usage (simply simply bytes)=277684224

File Outplace Format Counters

   Bytes Written=40

Step 8 − The folloearng command is used to verify the resultant files in the awayplace folder.

$HADOOP_HOME/bin/hadoop fs -ls awayplace_dir/

Step 9 − The folloearng command is used to see the awayplace in Part-00000 file. This file is generated simply simply by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat awayplace_dir/part-00000

Folloearng is the awayplace generated simply simply by the MapReduce program −

1981 34
1984 40
1985 45

Step 10 − The folloearng command is used to copy the awayplace folder from HDFS to the local file system.

$HADOOP_HOME/bin/hadoop fs -cat awayplace_dir/part-00000/bin/hadoop dfs -get awayplace_dir /home/hadoop

MapReduce – Partitioner

A partitioner works like a condition in procesperform an inplace dataset. The partition phase considers place after the Map phase and before the Reduce phase.

The number of partitioners is equal to the number of red-coloureddishucers. Thead wear means a partitioner will divide the data according to the number of red-coloureddishucers. Therefore, the data comppermiteed from a performle partitioner is processed simply simply by a performle Reducer.

Partitioner

A partitioner partitions the key-value pairs of intermediate Map-awayplaces. It partitions the data uperform a user-degreatd condition, which works like a hash function. The generall number of partitions is exwork exact same as the number of Reducer tasks for the job. Let us consider an example to belowstand how the partitioner works.

MapReduce Partitioner Implementation

For the sake of convenience, permit us bumume we have a small table calintroduced Employee with the folloearng data. We will use this particular particular sample data as our inplace dataset to demonstrate how the partitioner works.

Id Name Age Gender Salary
1201 gopal 45 Male 50,000
1202 manisha 40 Female 50,000
1203 khalil 34 Male 30,000
1204 prasanth 30 Male 30,000
1205 kiran 20 Male 40,000
1206 laxmi 25 Female 35,000
1207 bhavya 20 Female 15,000
1208 reshma 19 Female 15,000
1209 kranthi 22 Male 22,000
1210 Satish 24 Male 25,000
1211 Krishna 25 Male 25,000
1212 Arshad 28 Male 20,000
1213 lavany kind ofa 18 Female 8,000

We have to write an application to process the inplace dataset to find the highest salaried employee simply simply by gender in various age groups (for example, below 20, between 21 to 30, above 30).

Inplace Data

The above data is saved as inplace.txt in the “/home/hadoop/hadoopPartitioner” immediateory and given as inplace.

1201 gopal 45 Male 50000
1202 manisha 40 Female 51000
1203 khaleel 34 Male 30000
1204 prasanth 30 Male 31000
1205 kiran 20 Male 40000
1206 laxmi 25 Female 35000
1207 bhavya 20 Female 15000
1208 reshma 19 Female 14000
1209 kranthi 22 Male 22000
1210 Satish 24 Male 25000
1211 Krishna 25 Male 26000
1212 Arshad 28 Male 20000
1213 lavany kind ofa 18 Female 8000

Based on the given inplace, folloearng is the algorithmic explanation of the program.

Map Tasks

The chart task accepts the key-value pairs as inplace while we have the text data in a text file. The inplace for this particular particular chart task is as follows −

Inplace − The key would be a pattern such as “any kind of special key + filename + range number” (example: key = @inplace1) and the value would be the data in thead wear range (example: value = 1201 t gopal t 45 t Male t 50000).

Method − The operation of this particular particular chart task is as follows −

  • Read the value (record data), which comes as inplace value from the argument list in a string.

  • Uperform the split function, separate the gender and store in a string variable.

String[] str = value.toString().split("t", -3);
String gender=str[3];
  • Send the gender information and the record data value as awayplace key-value pair from the chart task to the partition task.

context.write(brand new Text(gender), brand new Text(value));
  • Repeat all the above steps for all the records in the text file.

Outplace − You will get the gender data and the record data value as key-value pairs.

Partitioner Task

The partitioner task accepts the key-value pairs from the chart task as it is inplace. Partition impare locateds dividing the data into segments. According to the given conditional criteria of partitions, the inplace key-value paired-coloureddish data can be divided into 3 parts based on the age criteria.

Inplace − The whole data in a collection of key-value pairs.

key = Gender field value in the record.

value = Whole record data value of thead wear gender.

Method − The process of partition logic operates as follows.

  • Read the age field value from the inplace key-value pair.
String[] str = value.toString().split("t");
int age = Integer.parseInt(str[2]);
  • Check the age value with the folloearng conditions.

    • Age less than or equal to 20
    • Age Greater than 20 and Less than or equal to 30.
    • Age Greater than 30.
if(age<=20)
{
   return 0;
}
else if(age>20 && age<=30)
{
   return 1 % numReduceTasks;
}
else
{
   return 2 % numReduceTasks;
}

Outplace − The whole data of key-value pairs are segmented into 3 collections of key-value pairs. The Reducer works individually on every collection.

Reduce Tasks

The number of partitioner tasks is equal to the number of red-coloureddishucer tasks. Here we have 3 partitioner tasks and hence we have 3 Reducer tasks to be executed.

Inplace − The Reducer will execute 3 times with various collection of key-value pairs.

key = gender field value in the record.

value = the whole record data of thead wear gender.

Method − The folloearng logic will be appare locatedd on every collection.

  • Read the Salary field value of every record.
String [] str = val.toString().split("t", -3);
Note: str[4] have the salary field value.
  • Check the salary with the max variable. If str[4] is the max salary, then bumign str[4] to max, otherwise skip the step.

if(Integer.parseInt(str[4])>max)
{
   max=Integer.parseInt(str[4]);
}
  • Repeat Steps 1 and 2 for every key collection (Male & Female are the key collections). After executing these 3 steps, you will find one max salary from the Male key collection and one max salary from the Female key collection.

context.write(brand new Text(key), brand new IntWritable(max));

Outplace − Finally, you will get a set of key-value pair data in 3 collections of various age groups. It contains the max salary from the Male collection and the max salary from the Female collection in every age group respectively.

After executing the Map, the Partitioner, and the Reduce tasks, the 3 collections of key-value pair data are stored-coloureddish in 3 various files as the awayplace.

All the 3 tasks are treated as MapReduce jobs. The folloearng requirements and specifications of these jobs need to be specified in the Configurations −

  • Job name
  • Inplace and Outplace formats of keys and values
  • Individual clbumes for Map, Reduce, and Partitioner tasks
Configuration conf = getConf();

//Create Job
Job job = brand new Job(conf, "topsal");
job.setJarByClbum(PartitionerExample.clbum);

// File Inplace and Outplace rawayes
FileInplaceFormat.setInplacePaths(job, brand new Path(arg[0]));
FileOutplaceFormat.setOutplacePath(job,brand new Path(arg[1]));

//Set Mapper clbum and Outplace format for key-value pair.
job.setMapperClbum(MapClbum.clbum);
job.setMapOutplaceKeyClbum(Text.clbum);
job.setMapOutplaceValueClbum(Text.clbum);

//set partitioner statement
job.setPartitionerClbum(CaderPartitioner.clbum);

//Set Reducer clbum and Inplace/Outplace format for key-value pair.
job.setReducerClbum(ReduceClbum.clbum);

//Number of Reducer tasks.
job.setNumReduceTasks(3);

//Inplace and Outplace format for data
job.setInplaceFormatClbum(TextInplaceFormat.clbum);
job.setOutplaceFormatClbum(TextOutplaceFormat.clbum);
job.setOutplaceKeyClbum(Text.clbum);
job.setOutplaceValueClbum(Text.clbum);

Example Program

The folloearng program shows how to implement the partitioners for the given criteria in a MapReduce program.

package partitionerexample;

imslot java.io.*;

imslot org.apache.hadoop.io.*;
imslot org.apache.hadoop.chartred-coloureddishuce.*;
imslot org.apache.hadoop.conf.*;
imslot org.apache.hadoop.conf.*;
imslot org.apache.hadoop.fs.*;

imslot org.apache.hadoop.chartred-coloureddishuce.lib.inplace.*;
imslot org.apache.hadoop.chartred-coloureddishuce.lib.awayplace.*;

imslot org.apache.hadoop.util.*;

public clbum PartitionerExample extends Configured-coloureddish implements Tool
{
   //Map clbum
	
   public static clbum MapClbum extends Mapper<LongWritable,Text,Text,Text>
   {
      public void chart(LongWritable key, Text value, Context context)
      {
         consider away{
            String[] str = value.toString().split("t", -3);
            String gender=str[3];
            context.write(brand new Text(gender), brand new Text(value));
         }
         catch(Exception e)
         {
            System.away.println(e.getMessage());
         }
      }
   }
   
   //Reducer clbum
	
   public static clbum ReduceClbum extends Reducer<Text,Text,Text,IntWritable>
   {
      public int max = -1;
      public void red-coloureddishuce(Text key, Iterable <Text> values, Context context) thlines IOException, InterruptedException
      {
         max = -1;
			
         for (Text val : values)
         {
            String [] str = val.toString().split("t", -3);
            if(Integer.parseInt(str[4])>max)
            max=Integer.parseInt(str[4]);
         }
			
         context.write(brand new Text(key), brand new IntWritable(max));
      }
   }
   
   //Partitioner clbum
	
   public static clbum CaderPartitioner extends
   Partitioner < Text, Text >
   {
      @Override
      public int getPartition(Text key, Text value, int numReduceTasks)
      {
         String[] str = value.toString().split("t");
         int age = Integer.parseInt(str[2]);
         
         if(numReduceTasks == 0)
         {
            return 0;
         }
         
         if(age<=20)
         {
            return 0;
         }
         else if(age>20 && age<=30)
         {
            return 1 % numReduceTasks;
         }
         else
         {
            return 2 % numReduceTasks;
         }
      }
   }
   
   @Override
   public int operate(String[] arg) thlines Exception
   {
      Configuration conf = getConf();
		
      Job job = brand new Job(conf, "topsal");
      job.setJarByClbum(PartitionerExample.clbum);
		
      FileInplaceFormat.setInplacePaths(job, brand new Path(arg[0]));
      FileOutplaceFormat.setOutplacePath(job,brand new Path(arg[1]));
		
      job.setMapperClbum(MapClbum.clbum);
		
      job.setMapOutplaceKeyClbum(Text.clbum);
      job.setMapOutplaceValueClbum(Text.clbum);
      
      //set partitioner statement
		
      job.setPartitionerClbum(CaderPartitioner.clbum);
      job.setReducerClbum(ReduceClbum.clbum);
      job.setNumReduceTasks(3);
      job.setInplaceFormatClbum(TextInplaceFormat.clbum);
		
      job.setOutplaceFormatClbum(TextOutplaceFormat.clbum);
      job.setOutplaceKeyClbum(Text.clbum);
      job.setOutplaceValueClbum(Text.clbum);
		
      System.exit(job.wait aroundForComppermition(true)? 0 : 1);
      return 0;
   }
   
   public static void main(String ar[]) thlines Exception
   {
      int res = ToolRunner.operate(brand new Configuration(), brand new PartitionerExample(),ar);
      System.exit(0);
   }
}

Save the above code as PartitionerExample.java in “/home/hadoop/hadoopPartitioner”. The compilation and execution of the program is given below.

Compilation and Execution

Let us bumume we are in the home immediateory of the Hadoop user (for example, /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. You can download the jar from mvnreposit down downory.com.

Let us bumume the downloaded folder is “/home/hadoop/hadoopPartitioner”

Step 2 − The folloearng commands are used for compiling the program PartitionerExample.java and creating a jar for the program.

$ javac -clbumrawaye hadoop-core-1.2.1.jar -d ProcessUnit is.java
$ jar -cvf PartitionerExample.jar -C .

Step 3 − Use the folloearng command to generate an inplace immediateory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir inplace_dir

Step 4 − Use the folloearng command to copy the inplace file named inplace.txt in the inplace immediateory of HDFS.

$HADOOP_HOME/bin/hadoop fs -place /home/hadoop/hadoopPartitioner/inplace.txt inplace_dir

Step 5 − Use the folloearng command to verify the files in the inplace immediateory.

$HADOOP_HOME/bin/hadoop fs -ls inplace_dir/

Step 6 − Use the folloearng command to operate the Top salary application simply simply by tacalifornia ruler inplace files from the inplace immediateory.

$HADOOP_HOME/bin/hadoop jar PartitionerExample.jar partitionerexample.PartitionerExample inplace_dir/inplace.txt awayplace_dir

Wait for a while till the file gets executed. After execution, the awayplace contains lots of inplace split is, chart tasks, and Reducer tasks.

15/02/04 15:19:51 INFO chartred-coloureddishuce.Job: Job job_1423027269044_0021 generalld successcomppermitey
15/02/04 15:19:52 INFO chartred-coloureddishuce.Job: Counters: 49

File System Counters

   FILE: Number of simply simply bytes read=467
   FILE: Number of simply simply bytes composed=426777
   FILE: Number of read operations=0
   FILE: Number of huge read operations=0
   FILE: Number of write operations=0
	
   HDFS: Number of simply simply bytes read=480
   HDFS: Number of simply simply bytes composed=72
   HDFS: Number of read operations=12
   HDFS: Number of huge read operations=0
   HDFS: Number of write operations=6
	
Job Counters

   Launched chart tasks=1
   Launched red-coloureddishuce tasks=3
	
   Data-local chart tasks=1
	
   Total time spent simply simply by all charts in occupied slots (ms)=8212
   Total time spent simply simply by all red-coloureddishuces in occupied slots (ms)=59858
   Total time spent simply simply by all chart tasks (ms)=8212
   Total time spent simply simply by all red-coloureddishuce tasks (ms)=59858
	
   Total vcore-seconds considern simply simply by all chart tasks=8212
   Total vcore-seconds considern simply simply by all red-coloureddishuce tasks=59858
	
   Total megasimply simply byte-seconds considern simply simply by all chart tasks=8409088
   Total megasimply simply byte-seconds considern simply simply by all red-coloureddishuce tasks=61294592
	
Map-Reduce Framework

   Map inplace records=13
   Map awayplace records=13
   Map awayplace simply simply bytes=423
   Map awayplace materialized simply simply bytes=467
	
   Inplace split simply simply bytes=119
	
   Combine inplace records=0
   Combine awayplace records=0
	
   Reduce inplace groups=6
   Reduce shuffle simply simply bytes=467
   Reduce inplace records=13
   Reduce awayplace records=6
	
   Spilintroduced Records=26
   Shuffintroduced Maps =3
   Faiintroduced Shuffles=0
   Merged Map awayplaces=3
   GC time elapsed (ms)=224
   CPU time spent (ms)=3690
	
   Physical memory (simply simply bytes) snapshot=553816064
   Virtual memory (simply simply bytes) snapshot=3441266688
	
   Total committed heap usage (simply simply bytes)=334102528
	
Shuffle Errors

   BAD_ID=0
   CONNECTION=0
   IO_ERROR=0
	
   WRONG_LENGTH=0
   WRONG_MAP=0
   WRONG_REDUCE=0
	
File Inplace Format Counters

   Bytes Read=361
	
File Outplace Format Counters

   Bytes Written=72

Step 7 − Use the folloearng command to verify the resultant files in the awayplace folder.

$HADOOP_HOME/bin/hadoop fs -ls awayplace_dir/

You will find the awayplace in 3 files because you are uperform 3 partitioners and 3 Reducers in your own own program.

Step 8 − Use the folloearng command to see the awayplace in Part-00000 file. This file is generated simply simply by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat awayplace_dir/part-00000

Outplace in Part-00000

Female   15000
Male     40000

Use the folloearng command to see the awayplace in Part-00001 file.

$HADOOP_HOME/bin/hadoop fs -cat awayplace_dir/part-00001

Outplace in Part-00001

Female   35000
Male    31000

Use the folloearng command to see the awayplace in Part-00002 file.

$HADOOP_HOME/bin/hadoop fs -cat awayplace_dir/part-00002

Outplace in Part-00002

Female  51000
Male   50000

MapReduce – Combiners

A Combiner, furthermore known as a semi-red-coloureddishucer, is an optional clbum thead wear operates simply simply by accepting the inplaces from the Map clbum and proper now proper now thereafter comppermiteing the awayplace key-value pairs to the Reducer clbum.

The main function of a Combiner is to summarize the chart awayplace records with the exwork exact same key. The awayplace (key-value collection) of the combiner will be sent over the network to the workual Reducer task as inplace.

Combiner

The Combiner clbum is used in between the Map clbum and the Reduce clbum to red-coloureddishuce the volume of data transfer between Map and Reduce. Usually, the awayplace of the chart task is huge and the data transferred-coloureddish to the red-coloureddishuce task is high.

The folloearng MapReduce task diagram shows the COMBINER PHASE.

Combiner

How Combiner Works?

Here is a short summary on how MapReduce Combiner works −

  • A combiner does not have a pred-coloureddishegreatd interface and it must implement the Reducer interface’s red-coloureddishuce() method.

  • A combiner operates on every chart awayplace key. It must have the exwork exact same awayplace key-value kinds as the Reducer clbum.

  • A combiner can produce summary information from a huge dataset because it replaces the initial Map awayplace.

Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce phase, which generates it easier to process.

MapReduce Combiner Implementation

The folloearng example provides a theoretical idea abaway combiners. Let us bumume we have the folloearng inplace text file named inplace.txt for MapReduce.

Whead wear do you mean simply simply by Object
Whead wear do you know abaway Java
Whead wear is Java Virtual Machine
How Java enabintroduced High Performance

The imslotant phases of the MapReduce program with Combiner are talk abouted below.

Record Reader

This is the first phase of MapReduce where the Record Reader reads every range from the inplace text file as text and yields awayplace as key-value pairs.

Inplace − Line simply simply by range text from the inplace file.

Outplace − Forms the key-value pairs. The folloearng is the set of expected key-value pairs.

<1, Whead wear do you mean simply simply by Object>
<2, Whead wear do you know abaway Java>
<3, Whead wear is Java Virtual Machine>
<4, How Java enabintroduced High Performance>

Map Phase

The Map phase considers inplace from the Record Reader, processes it, and produces the awayplace as another set of key-value pairs.

Inplace − The folloearng key-value pair is the inplace considern from the Record Reader.

<1, Whead wear do you mean simply simply by Object>
<2, Whead wear do you know abaway Java>
<3, Whead wear is Java Virtual Machine>
<4, How Java enabintroduced High Performance>

The Map phase reads every key-value pair, divides every word from the value uperform StringTokenizer, treats every word as key and the count of thead wear word as value. The folloearng code snippet shows the Mapper clbum and the chart function.

public static clbum TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
   private final static IntWritable one = brand new IntWritable(1);
   private Text word = brand new Text();
   
   public void chart(Object key, Text value, Context context) thlines IOException, InterruptedException 
   {
      StringTokenizer itr = brand new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) 
      {
         word.set(itr.nextToken());
         context.write(word, one);
      }
   }
}

Outplace − The expected awayplace is as follows −

<Whead wear,1> <do,1> <you,1> <mean,1> <simply simply by,1> <Object,1>
<Whead wear,1> <do,1> <you,1> <know,1> <abaway,1> <Java,1>
<Whead wear,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabintroduced,1> <High,1> <Performance,1>

Combiner Phase

The Combiner phase considers every key-value pair from the Map phase, processes it, and produces the awayplace as key-value collection pairs.

Inplace − The folloearng key-value pair is the inplace considern from the Map phase.

<Whead wear,1> <do,1> <you,1> <mean,1> <simply simply by,1> <Object,1>
<Whead wear,1> <do,1> <you,1> <know,1> <abaway,1> <Java,1>
<Whead wear,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabintroduced,1> <High,1> <Performance,1>

The Combiner phase reads every key-value pair, combines the common words as key and values as collection. Usually, the code and operation for a Combiner is similar to thead wear of a Reducer. Folloearng is the code snippet for Mapper, Combiner and Reducer clbum declaration.

job.setMapperClbum(TokenizerMapper.clbum);
job.setCombinerClbum(IntSumReducer.clbum);
job.setReducerClbum(IntSumReducer.clbum);

Outplace − The expected awayplace is as follows −

<Whead wear,1,1,1> <do,1,1> <you,1,1> <mean,1> <simply simply by,1> <Object,1>
<know,1> <abaway,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabintroduced,1> <High,1> <Performance,1>

Reducer Phase

The Reducer phase considers every key-value collection pair from the Combiner phase, processes it, and comppermitees the awayplace as key-value pairs. Note thead wear the Combiner functionality is exwork exact same as the Reducer.

Inplace − The folloearng key-value pair is the inplace considern from the Combiner phase.

<Whead wear,1,1,1> <do,1,1> <you,1,1> <mean,1> <simply simply by,1> <Object,1>
<know,1> <abaway,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabintroduced,1> <High,1> <Performance,1>

The Reducer phase reads every key-value pair. Folloearng is the code snippet for the Combiner.

public static clbum IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> 
{
   private IntWritable result = brand new IntWritable();
   
   public void red-coloureddishuce(Text key, Iterable<IntWritable> values,Context context) thlines IOException, InterruptedException 
   {
      int sum = 0;
      for (IntWritable val : values) 
      {
         sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
   }
}

Outplace − The expected awayplace from the Reducer phase is as follows −

<Whead wear,3> <do,2> <you,2> <mean,1> <simply simply by,1> <Object,1>
<know,1> <abaway,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabintroduced,1> <High,1> <Performance,1>

Record Writer

This is the final phase of MapReduce where the Record Writer writes every key-value pair from the Reducer phase and sends the awayplace as text.

Inplace − Each key-value pair from the Reducer phase alengthy with the Outplace format.

Outplace − It gives you the key-value pairs in text format. Folloearng is the expected awayplace.

Whead wear           3
do             2
you            2
mean           1
simply simply by             1
Object         1
know           1
abaway          1
Java           3
is             1
Virtual        1
Machine        1
How            1
enabintroduced        1
High           1
Performance    1

Example Program

The folloearng code block counts the number of words in a program.

imslot java.io.IOException;
imslot java.util.StringTokenizer;

imslot org.apache.hadoop.conf.Configuration;
imslot org.apache.hadoop.fs.Path;

imslot org.apache.hadoop.io.IntWritable;
imslot org.apache.hadoop.io.Text;

imslot org.apache.hadoop.chartred-coloureddishuce.Job;
imslot org.apache.hadoop.chartred-coloureddishuce.Mapper;
imslot org.apache.hadoop.chartred-coloureddishuce.Reducer;
imslot org.apache.hadoop.chartred-coloureddishuce.lib.inplace.FileInplaceFormat;
imslot org.apache.hadoop.chartred-coloureddishuce.lib.awayplace.FileOutplaceFormat;

public clbum WordCount {
   public static clbum TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
   {
      private final static IntWritable one = brand new IntWritable(1);
      private Text word = brand new Text();
      
      public void chart(Object key, Text value, Context context) thlines IOException, InterruptedException 
      {
         StringTokenizer itr = brand new StringTokenizer(value.toString());
         while (itr.hasMoreTokens()) 
         {
            word.set(itr.nextToken());
            context.write(word, one);
         }
      }
   }
   
   public static clbum IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> 
   {
      private IntWritable result = brand new IntWritable();
      public void red-coloureddishuce(Text key, Iterable<IntWritable> values, Context context) thlines IOException, InterruptedException 
      {
         int sum = 0;
         for (IntWritable val : values) 
         {
            sum += val.get();
         }
         result.set(sum);
         context.write(key, result);
      }
   }
   
   public static void main(String[] args) thlines Exception 
   {
      Configuration conf = brand new Configuration();
      Job job = Job.getInstance(conf, "word count");
		
      job.setJarByClbum(WordCount.clbum);
      job.setMapperClbum(TokenizerMapper.clbum);
      job.setCombinerClbum(IntSumReducer.clbum);
      job.setReducerClbum(IntSumReducer.clbum);
		
      job.setOutplaceKeyClbum(Text.clbum);
      job.setOutplaceValueClbum(IntWritable.clbum);
		
      FileInplaceFormat.addInplacePath(job, brand new Path(args[0]));
      FileOutplaceFormat.setOutplacePath(job, brand new Path(args[1]));
		
      System.exit(job.wait aroundForComppermition(true) ? 0 : 1);
   }
}

Save the above program as WordCount.java. The compilation and execution of the program is given below.

Compilation and Execution

Let us bumume we are in the home immediateory of Hadoop user (for example, /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1 − Use the folloearng command to generate a immediateory to store the compiintroduced java clbumes.

$ mkdir devices

Step 2 − Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. You can download the jar from mvnreposit down downory.com.

Let us bumume the downloaded folder is /home/hadoop/.

Step 3 − Use the folloearng commands to compile the WordCount.java program and to generate a jar for the program.

$ javac -clbumrawaye hadoop-core-1.2.1.jar -d devices WordCount.java
$ jar -cvf devices.jar -C devices/ .

Step 4 − Use the folloearng command to generate an inplace immediateory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir inplace_dir

Step 5 − Use the folloearng command to copy the inplace file named inplace.txt in the inplace immediateory of HDFS.

$HADOOP_HOME/bin/hadoop fs -place /home/hadoop/inplace.txt inplace_dir

Step 6 − Use the folloearng command to verify the files in the inplace immediateory.

$HADOOP_HOME/bin/hadoop fs -ls inplace_dir/

Step 7 − Use the folloearng command to operate the Word count application simply simply by tacalifornia ruler inplace files from the inplace immediateory.

$HADOOP_HOME/bin/hadoop jar devices.jar hadoop.ProcessUnit is inplace_dir awayplace_dir

Wait for a while till the file gets executed. After execution, the awayplace contains lots of inplace split is, Map tasks, and Reducer tasks.

Step 8 − Use the folloearng command to verify the resultant files in the awayplace folder.

$HADOOP_HOME/bin/hadoop fs -ls awayplace_dir/

Step 9 − Use the folloearng command to see the awayplace in Part-00000 file. This file is generated simply simply by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat awayplace_dir/part-00000

Folloearng is the awayplace generated simply simply by the MapReduce program.

Whead wear           3
do             2
you            2
mean           1
simply simply by             1
Object         1
know           1
abaway          1
Java           3
is             1
Virtual        1
Machine        1
How            1
enabintroduced        1
High           1
Performance    1

MapReduce – Hadoop Administration

This chapter exbasics Hadoop administration which includes both HDFS and MapReduce administration.

  • HDFS administration includes monitoring the HDFS file structure, locations, and the updated files.

  • MapReduce administration includes monitoring the list of applications, configuration of nodes, application status, etc.

HDFS Monitoring

HDFS (Hadoop Distributed File System) contains the user immediateories, inplace files, and awayplace files. Use the MapReduce commands, place and get, for storing and retrieving.

After starting the Hadoop framework (daemons) simply simply by comppermiteing the command “start-all.sh” on “/$HADOOP_HOME/sbin”, comppermite the folloearng URL to the blineser “http://localhost:50070”. You need to see the folloearng screen on your own own blineser.

The folloearng screenshot shows how to blinese the blinese HDFS.

HDFS Monitoring

The folloearng screenshot show the file structure of HDFS. It shows the files in the “/user/hadoop” immediateory.

HDFS Files

The folloearng screenshot shows the Datanode information in a cluster. Here you can find one node with it is configurations and capacilinks.

Datanoda Information

MapReduce Job Monitoring

A MapReduce application is a collection of jobs (Map job, Combiner, Partitioner, and Reduce job). It is mandatory to monitor and maintain the folloearng −

  • Configuration of datanode where the application is suitable.
  • The number of datanodes and resources used per application.

To monitor all these things, it is imperative thead wear we need to have a user interface. After starting the Hadoop framework simply simply by comppermiteing the command “start-all.sh” on “/$HADOOP_HOME/sbin”, comppermite the folloearng URL to the blineser “http://localhost:8080”. You need to see the folloearng screen on your own own blineser.

Job Monitoring

In the above screenshot, the hand stageer is on the application ID. Just click on it to find the folloearng screen on your own own blineser. It describes the folloearng −

  • On which user the current application is operatening

  • The application name

  • Type of thead wear application

  • Current status, Final status

  • Application started time, elapsed (generalld time), if it is generall at the time of monitoring

  • The background of this particular particular application, i.e., log information

  • And finally, the node information, i.e., the nodes thead wear participated in operatening the application.

The folloearng screenshot shows the details of a particular application −

Application ID

The folloearng screenshot describes the currently operatening nodes information. Here, the screenshot contains only one node. A hand stageer shows the localhost address of the operatening node.

All Nodes

SHARE
Previous articleLucene
Next articleVLSI Design

NO COMMENTS

LEAVE A REPLY