Hadoop

0
154

Hadoop – Big Data Oversee

“90% of the world’s data was generated in the final couple of oceansons.”

Due to the advent of brand new technologies, devices, and communication means like social ne2rcalifornia california king sit downes, the amount of data produced simply by mankind is glineing rapidly every oceanson. The amount of data produced simply by us from the becomeginning of time till 2003 was 5 billion gigasimply bytes. If you pile up the data in the form of disks it may fill an entire football field. The exaction same amount was developd in every 2 days in 2011, and in every ten moments in 2013. This rate is still glineing enormously. Though all this particular withinformation produced is meaningful and can become useful when processed, it is becomeing neglected.

Wmind use is Big Data?

Big Data is a collection of huge datasets tmind use cannot become processed uperform traditional computing techniques. It is not a performle technique or a tool, instead it involves many kind of areas of business and technology.

Wmind use Comes Under Big Data?

Big data involves the data produced simply by various devices and applications. Given becomelow are a few of the fields tmind use come below the umbrella of Big Data.

  • Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of miplantshones and earphones, and the performance information of the aircraft.

  • Social Media Data : Social media such as Facebook and Twitter hold information and the sees posted simply by millions of people amix the globecome.

  • Stock Exalter Data : The stock exalter data holds information abaway the ‘buy’ and ‘sell’ decisions made on a share of various companies made simply by the customers.

  • Power Grid Data : The power grid data holds information consumed simply by a particular node with respect to a base station.

  • Transslot Data : Transslot data includes model, capacity, distance and availpossible of a vehicle.

  • Search Engine Data : Search engines retrieve lots of data from various databases.

Big Data

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will become of 3 kinds.

  • Structureddish-coloureddish data : Relational data.

  • Semi Structureddish-coloureddish data : XML data.

  • Unstructureddish-coloureddish data : Word, PDF, Text, Media Logs.

Benefit is of Big Data

  • Uperform the information kept in the social ne2rk like Facebook, the marketing agencies are find out presently thereing abaway the response for their own particular get awayaigns, promotions, and other advertiperform mediums.

  • Uperform the information in the social media like preferences and item perception of their own particular consumers, item companies and retail body organizations are planning their own particular itemion.

  • Uperform the data regarding the previous medical history of paconnectnts, hospitals are providing becometter and fast service.

Big Data Technologies

Big data technologies are imslotant in providing more precise analysis, which may lead to more concrete decision-macalifornia california king resulting in greater operational efficiencies, cost reddish-coloureddishuctions, and reddish-coloureddishuced risks for the business.

To harness the power of large data, you would require an infrastructure tmind use can manage and process huge volumes of structureddish-coloureddish and unstructureddish-coloureddish data in realtime and can protect data privacy and security.

There are various technologies in the market from various vendors including Amazon, IBM, Microgentle, etc., to handle large data. While loocalifornia california king into the technologies tmind use handle large data, we examine the following 2 coursees of technology:

Operational Big Data

These include systems like MongoDB tmind use provide operational capabiliconnects for real-time, interworkive workloads where data is primarily captureddish-coloureddish and storeddish-coloureddish.

NoSQL Big Data systems are designed to conpartr advantage of brand new cloud computing architectures tmind use have emerged over the past decade to enable massive computations to become run inexpensively and effectively. This develops operational large data workloads a lot easier to manage, cheaper, and faster to implement.

Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and withaway the need for data scientists and additional infrastructure.

Analytical Big Data

These includes systems like Massively Parallel Procesperform (MPP) database systems and MapReduce tmind use provide analytical capabiliconnects for retrospective and complex analysis tmind use may touch the majority of or all of the data.

MapReduce provides a brand new method of analyzing data tmind use is complementary to the capabiliconnects provided simply by SQL, and a system based on MapReduce tmind use can become ranged up from performle servers to thougreat great sands of high and low end machines.

These 2 coursees of technology are complementary and regularly deployed collectively.

Operational vs. Analytical Systems

Operational Analytical
Latency 1 ms – 100 ms 1 min – 100 min
Concurrency 1000 – 100,000 1 – 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database

Big Data Challenges

The major challenges associated with large data are as follows:

  • Capturing data
  • Curation
  • Storage
  • Searching
  • Sharing
  • Transfer
  • Analysis
  • Presentation

To fulfill the above challenges, body organizations normally conpartr the help of enterprise servers.

Hadoop – Big Data Solutions

Traditional Enterprise Approach

In this particular approach, an enterprise will have a computer to store and process large data. For storage purpose, the programmers will conpartr the help of their own particular choice of database vendors such as Oracle, IBM, etc. In this particular approach, the user interworks with the application, which in turn handles the part of data storage and analysis.

Big Data

Limitation

This approach works great with those applications tmind use process less voluminous data tmind use can become accommodated simply by standard database servers, or up to the limit of the processor tmind use is procesperform the data. But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a performle database bottleneck.

Google’s Solution

Google solved this particular problem uperform an algorithm calimmediateed MapReduce. This algorithm divides the task into small parts and assigns all of all of them to many kind of computers, and collects the results from all of all of them which when integrated, form the result dataset.

Big Data

Hadoop

Uperform the solution provided simply by Google, Doug Cutting and his team developed an Open Source Project calimmediateed HADOOP.

Hadoop runs applications uperform the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications tmind use could perform compenablee statistical analysis on huge amounts of data.

Big Data

Hadoop – Introduction to Hadoop

Hadoop is an Apache open source framework composed in java tmind use enables distributed procesperform of huge datasets amix clusters of computers uperform fundamental programming models. The Hadoop framework application works in an environment tmind use provides distributed storage and computation amix clusters of computers. Hadoop is designed to range up from performle server to thougreat great sands of machines, every provideing local computation and storage.

Hadoop Architecture

At it is core, Hadoop has 2 major layers namely:

  • Procesperform/Computation layer (MapReduce), and
  • Storage layer (Hadoop Distributed File System).

Big Data

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google for effective procesperform of huge amounts of data (multiterasimply byte data-sets), on huge clusters (thougreat great sands of nodes) of commodity hardbattlee in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system tmind use is designed to run on commodity hardbattlee. It has many kind of similariconnects with existing distributed file systems. However, the differences from other distributed file systems are substantial. It is highly fault-tolerant and is designed to become deployed on low-cost hardbattlee. It provides high throughput access to application data and is suitable for applications having huge datasets.

Apart from the above-mentioned 2 core components, Hadoop framework furthermore includes the following 2 modules:

  • Hadoop Common : These are Java libraries and utiliconnects requireddish-coloureddish simply by other Hadoop modules.

  • Hadoop YARN : This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

It is quite expensive to develop largeger servers with weighty configurations tmind use handle huge range procesperform, but as an alternative, you can connect collectively many kind of commodity computers with performle-CPU, as a performle functional distributed system and prworkically, the clustereddish-coloureddish machines can read the dataset in parallel and provide a a lot higher throughput. Moreover, it is cheaper than one high-end server. So this particular is the 1st motivational fworkor becomehind uperform Hadoop tmind use it runs amix clustereddish-coloureddish and low-cost machines.

Hadoop runs code amix a cluster of computers. This process includes the following core tasks tmind use Hadoop performs:

  • Data is preliminaryly divided into immediateories and files. Files are divided into uniform dimensiond blocks of 128M and 64M (preferably 128M).
  • These files are then distributed amix various cluster nodes for further procesperform.
  • HDFS, becomeing on top of the local file system, supervises the procesperform.
  • Blocks are replicated for handling hardbattlee failure.
  • Checcalifornia california king tmind use the code was executed successcompallowey.
  • Performing the sort tmind use conpartrs place becometween the chart and reddish-coloureddishuce stages.
  • Sending the sorted data to a specific computer.
  • Writing the debugging logs for every job.

Advantages of Hadoop

  • Hadoop framework enables the user to fastly write and check distributed systems. It is effective, and it automatic distributes the data and work amix the machines and in turn, utilizes the belowlying parallelism of the CPU cores.

  • Hadoop does not rely on hardbattlee to provide fault-tolerance and high availpossible (FTHA), instead Hadoop library it iself has becomeen designed to detect and handle failures at the application layer.

  • Servers can become added or removed from the cluster dynamically and Hadoop continues to operate withaway interruption.

  • Another large advantage of Hadoop is tmind use apart from becomeing open source, it is compatible on all the platforms since it is Java based.

Hadoop – Enviornment Setup

Hadoop is supsloted simply by GNU/Linux platform and it is flavours. Therefore, we have to install a Linux operating system for setting up Hadoop environment. In case you have an OS other than Linux, you can install a Virtualpackage gentlebattlee in it and have Linux inpart the Virtualpackage.

Pre-installation Setup

Before installing Hadoop into the Linux environment, we need to set up Linux uperform ssh (Secure Shell). Follow the steps given becomelow for setting up the Linux environment.

Creating a User

At the becomeginning, it is recommended to develop a separate user for Hadoop to isolate Hadoop file system from Unix file system. Follow the steps given becomelow to develop a user:

  • Open the root uperform the command “su”.
  • Create a user from the root account uperform the command “useradd username”.
  • Now you can open an existing user account uperform the command “su username”.

Open the Linux terminal and kind the following commands to develop a user.

$ su 
   moveword: 
# useradd hadoop 
# movewd hadoop 
   New movewd: 
   Rekind brand new movewd 

SSH Setup and Key Generation

SSH setup is requireddish-coloureddish to do various operations on a cluster such as starting, endping, distributed daemon shell operations. To authenticate various users of Hadoop, it is requireddish-coloureddish to provide public/private key pair for a Hadoop user and share it with various users.

The following commands are used for generating a key value pair uperform SSH. Copy the public keys form id_rsa.pub to authorised_keys, and provide the owner with read and write permissions to authorised_keys file respectively.

$ ssh-keygen -t rsa 
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorised_keys 
$ chmod 0600 ~/.ssh/authorised_keys 

Installing Java

Java is the main prerequisit downe for Hadoop. First of all, you need to verify the existence of java in your own system uperform the command “java -version”. The syntax of java version command is given becomelow.

$ java -version 

If everyslimg is within order, it will give you the following awayput.

java version "1.7.0_71" 
Java(TM) SE Runtime Environment (develop 1.7.0_71-b13) 
Java HotSpot(TM) Claynt VM (develop 25.0-b02, mixed mode)  

If java is not instalimmediateed in your own system, then follow the steps given becomelow for installing java.

Step 1

Download java (JDK <lacheck version> – X64.tar.gz) simply by visit downing the following link /index.php?s=%20httpwwworaclecomtechne2rkjavajavasedownloadsjdk7-downloads1880260html.

Then jdk-7u71-linux-x64.tar.gz will become downloaded into your own system.

Step 2

Generally you will find the downloaded java file in Downloads folder. Verify it and extrwork the jdk-7u71-linux-x64.gz file uperform the following commands.

$ cd Downloads/ 
$ ls 
jdk-7u71-linux-x64.gz 
$ tar zxf jdk-7u71-linux-x64.gz 
$ ls 
jdk1.7.0_71   jdk-7u71-linux-x64.gz 

Step 3

To develop java available to all the users, you have to move it to the location “/usr/local/”. Open root, and kind the following commands.

$ su 
moveword: 
# mv jdk1.7.0_71 /usr/local/ 
# exit 

Step 4

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

exslot JAVA_HOME=/usr/local/jdk1.7.0_71 
exslot PATH=PATH:$JAVA_HOME/bin 

Now verify the java -version command from the terminal as exfundamentaled above.

Downloading Hadoop

Download and extrwork Hadoop 2.4.1 from Apache gentlebattlee foundation uperform the following commands.

$ su 
moveword: 
# cd /usr/local 
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/ 
hadoop-2.4.1.tar.gz 
# tar xzf hadoop-2.4.1.tar.gz 
# mv hadoop-2.4.1/* to hadoop/ 
# exit 

Hadoop Operation Modes

Once you have downloaded Hadoop, you can operate your own Hadoop cluster in one of the 3 supsloted modes:

  • Local/Standasingle Mode : After downloading Hadoop in your own system, simply by default, it is configureddish-coloureddish in a standasingle mode and can become run as a performle java process.

  • Pseudo Distributed Mode : It is a distributed simulation on performle machine. Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is useful for development.

  • Fully Distributed Mode : This mode is compallowey distributed with minimum 2 or more machines as a cluster. We will come amix this particular mode in detail in the coming chapters.

Installing Hadoop in Standasingle Mode

Here we will talk about the installation of Hadoop 2.4.1 in standasingle mode.

There are no daemons running and everyslimg runs in a performle JVM. Standasingle mode is suitable for running MapReduce programs during development, since it is easy to check and debug all of all of them.

Setting Up Hadoop

You can set Hadoop environment variables simply by appending the following commands to ~/.bashrc file.

exslot HADOOP_HOME=/usr/local/hadoop 

Before proceeding further, you need to develop sure tmind use Hadoop is worcalifornia california king great. Just issue the following command:

$ hadoop version 

If everyslimg is great with your own setup, then you need to see the following result:

Hadoop 2.4.1 
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 
Compiimmediateed simply by hortonmu on 2013-10-07T06:28Z 
Compiimmediateed with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4 

It means your own Hadoop's standasingle mode setup is worcalifornia california king great. By default, Hadoop is configureddish-coloureddish to run in a non-distributed mode on a performle machine.

Example

Let's check a fundamental example of Hadoop. Hadoop installation deresiders the following example MapReduce jar file, which provides fundamental functionality of MapReduce and can become used for calculating, like Pi value, word counts in a given list of files, etc.

$HADOOP_HOME/share/hadoop/chartreddish-coloureddishuce/hadoop-chartreddish-coloureddishuce-examples-2.2.0.jar 

Let's have an input immediateory where we will push a couple of files and our requirement is to count the overallll numbecomer of words in those files. To calculate the overallll numbecomer of words, we do not need to write our MapReduce, provided the .jar file contains the implementation for word count. You can conpartr other examples uperform the exaction same .jar file; simply issue the following commands to check supsloted MapReduce functional programs simply by hadoop-chartreddish-coloureddishuce-examples-2.2.0.jar file.

$ hadoop jar $HADOOP_HOME/share/hadoop/chartreddish-coloureddishuce/hadoop-chartreddish-coloureddishuceexamples-2.2.0.jar 

Step 1

Create temporary content files in the input immediateory. You can develop this particular withinput immediateory any kind ofwhere you would like to work.

$ mkdir input 
$ cp $HADOOP_HOME/*.txt input 
$ ls -l input 

It will give the following files in your own input immediateory:

overallll 24 
-rw-r--r-- 1 root root 15164 Feb 21 10:14 LICENSE.txt 
-rw-r--r-- 1 root root   101 Feb 21 10:14 NOTICE.txt
-rw-r--r-- 1 root root  1366 Feb 21 10:14 README.txt 

These files have becomeen copied from the Hadoop installation home immediateory. For your own experiment, you can have various and huge sets of files.

Step 2

Let's start the Hadoop process to count the overallll numbecomer of words in all the files available in the input immediateory, as follows:

$ hadoop jar $HADOOP_HOME/share/hadoop/chartreddish-coloureddishuce/hadoop-chartreddish-coloureddishuceexamples-2.2.0.jar  wordcount input awayput 

Step 3

Step-2 will do the requireddish-coloureddish procesperform and save the awayput in awayput/part-r00000 file, which you can check simply by uperform:

$cat awayput/* 

It will list down all the words along with their own particular overallll counts available in all the files available in the input immediateory.

"AS      4 
"Contribution" 1 
"Contributor" 1 
"Derivative 1
"Legal 1
"License"      1
"License");     1 
"Licensor"      1
"NOTICE”        1 
"Not      1 
"Object"        1 
"Source”        1 
"Work”    1 
"You"     1 
"Your")   1 
"[]"      1 
"control"       1 
"printed        1 
"submitted"     1 
(50%)     1 
(BIS),    1 
(C)       1 
(Don't)   1 
(ECCN)    1 
(INCLUDING      2 
(INCLUDING,     2 
.............

Installing Hadoop in Pseudo Distributed Mode

Follow the steps given becomelow to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1: Setting Up Hadoop

You can set Hadoop environment variables simply by appending the following commands to ~/.bashrc file.

exslot HADOOP_HOME=/usr/local/hadoop 
exslot HADOOP_MAPRED_HOME=$HADOOP_HOME 
exslot HADOOP_COMMON_HOME=$HADOOP_HOME 
exslot HADOOP_HDFS_HOME=$HADOOP_HOME 
exslot YARN_HOME=$HADOOP_HOME 
exslot HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native 
exslot PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin 
exslot HADOOP_INSTALL=$HADOOP_HOME 

Now apply all the alters into the current running system.

$ source ~/.bashrc 

Step 2: Hadoop Configuration

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is requireddish-coloureddish to develop alters in those configuration files according to your own Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs in java, you have to reset the java environment variables in hadoop-env.sh file simply by replacing JAVA_HOME value with the location of java in your own system.

exslot JAVA_HOME=/usr/local/jdk1.7.0_71

The following are the list of files tmind use you have to edit to configure Hadoop.

core-sit downe.xml

The core-sit downe.xml file contains information such as the slot numbecomer used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and dimension of Read/Write buffers.

Open the core-sit downe.xml and add the following properconnects in becometween <configuration>, </configuration> tags.

<configuration>

   <home>
      <name>fs.default.name </name>
      <value> hdfs://localhost:9000 </value> 
   </home>
 
</configuration>

hdfs-sit downe.xml

The hdfs-sit downe.xml file contains information such as the value of replication data, namenode route, and datanode routes of your own local file systems. It means the place where you want to store the Hadoop infrastructure.

Let us assume the following data.

dfs.replication (data replication value) = 1 
(In the becomelow given route /hadoop/ is the user name. 
hadoopinfra/hdfs/namenode is the immediateory developd simply by hdfs file system.) 
namenode route = //home/hadoop/hadoopinfra/hdfs/namenode 
(hadoopinfra/hdfs/datanode is the immediateory developd simply by hdfs file system.) 
datanode route = //home/hadoop/hadoopinfra/hdfs/datanode 

Open this particular file and add the following properconnects in becometween the <configuration> </configuration> tags in this particular file.

<configuration>

   <home>
      <name>dfs.replication</name>
      <value>1</value>
   </home>
    
   <home>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
   </home>
    
   <home>
      <name>dfs.data.dir</name> 
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value> 
   </home>
       
</configuration>

Note: In the above file, all the home values are user-degreatd and you can develop alters according to your own Hadoop infrastructure.

yarn-sit downe.xml

This file is used to configure yarn into Hadoop. Open the yarn-sit downe.xml file and add the following properconnects in becometween the <configuration>, </configuration> tags in this particular file.

<configuration>
 
   <home>
      <name>yarn.nodemanager.aux-services</name>
      <value>chartreddish-coloureddishuce_shuffle</value> 
   </home>
  
</configuration>

chartreddish-coloureddish-sit downe.xml

This file is used to specify which MapReduce framework we are uperform. By default, Hadoop contains a template of yarn-sit downe.xml. First of all, it is requireddish-coloureddish to duplicate the file from chartreddish-coloureddish-sit downe,xml.template to chartreddish-coloureddish-sit downe.xml file uperform the following command.

$ cp chartreddish-coloureddish-sit downe.xml.template chartreddish-coloureddish-sit downe.xml 

Open chartreddish-coloureddish-sit downe.xml file and add the following properconnects in becometween the <configuration>, </configuration>tags in this particular file.

<configuration>
 
   <home> 
      <name>chartreddish-coloureddishuce.framework.name</name>
      <value>yarn</value>
   </home>
   
</configuration>

Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup

Set up the namenode uperform the command “hdfs namenode -format” as follows.

$ cd ~ 
$ hdfs namenode -format 

The expected result is as follows.

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG:   host = localhost/192.168.1.11 
STARTUP_MSG:   args = [-format] 
STARTUP_MSG:   version = 2.4.1 
...
...
10/24/14 21:30:56 INFO common.Storage: Storage immediateory 
/home/hadoop/hadoopinfra/hdfs/namenode has becomeen successcompallowey formatted. 
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to 
retain 1 images with txid >= 0 
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************ 
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 
************************************************************/

Step 2: Verifying Hadoop dfs

The following command is used to start dfs. Executing this particular command will start your own Hadoop file system.

$ start-dfs.sh 

The expected awayput is as follows:

10/24/14 21:37:56 
Starting namenodes on [localhost] 
localhost: starting namenode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-namenode-localhost.away 
localhost: starting datanode, logging to /home/hadoop/hadoop
2.4.1/logs/hadoop-hadoop-datanode-localhost.away 
Starting 2ndary namenodes [0.0.0.0]

Step 3: Verifying Yarn Script

The following command is used to start the yarn script. Executing this particular command will start your own yarn daemons.

$ start-yarn.sh 

The expected awayput as follows:

starting yarn daemons 
starting resourcemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-resourcemanager-localhost.away 
localhost: starting nodemanager, logging to /home/hadoop/hadoop
2.4.1/logs/yarn-hadoop-nodemanager-localhost.away 

Step 4: Accesperform Hadoop on Blineser

The default slot numbecomer to access Hadoop is 50070. Use the following url to get Hadoop services on blineser.

http://localhost:50070/

Big Data

Step 5: Verify All Applications for Cluster

The default slot numbecomer to access all applications of cluster is 8088. Use the following url to visit down this particular service.

http://localhost:8088/

Big Data

Hadoop – Hdfs Oversee

Hadoop File System was developed uperform distributed file system design. It is run on commodity hardbattlee. Unlike other distributed systems, HDFS is highly faulttolerant and designed uperform low-cost hardbattlee.

HDFS holds very huge amount of data and provides easier access. To store such huge data, the files are storeddish-coloureddish amix multiple machines. These files are storeddish-coloureddish in reddish-coloureddishundant fashion to rescue the system from feasible data losses in case of failure. HDFS furthermore develops applications available to parallel procesperform.

Features of HDFS

  • It is suitable for the distributed storage and procesperform.
  • Hadoop provides a command interface to interwork with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.

HDFS Architecture

Given becomelow is the architecture of a Hadoop File System.

Big Data

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardbattlee tmind use contains the GNU/Linux operating system and the namenode gentlebattlee. It is a gentlebattlee tmind use can become run on commodity hardbattlee. The system having the namenode works as the master server and it does the following tasks:

  • Manages the file system namespace.
  • Regulates claynt’s access to files.
  • It furthermore executes file system operations such as renaming, cloperform, and opening files and immediateories.

Datanode

The datanode is a commodity hardbattlee having the GNU/Linux operating system and datanode gentlebattlee. For every node (Commodity hardbattlee/System) in a cluster, generally presently there will become a datanode. These nodes manage the data storage of their own particular system.

  • Datanodes perform read-write operations on the file systems, as per claynt request.
  • They furthermore perform operations such as block creation, deenableion, and replication according to the instructions of the namenode.

Block

Generally the user data is storeddish-coloureddish in the files of HDFS. The file in a file system will become divided into one or more segments and/or storeddish-coloureddish in individual data nodes. These file segments are calimmediateed as blocks. In other words, the minimum amount of data tmind use HDFS can read or write is calimmediateed a Block. The default block dimension is 64MB, but it can become incrrelayved as per the need to alter in HDFS configuration.

Goals of HDFS

  • Fault detection and recovery : Since HDFS includes a huge numbecomer of commodity hardbattlee, failure of components is regular. Therefore HDFS need to have mechanisms for fast and automatic fault detection and recovery.

  • Huge datasets : HDFS need to have 100-coloureddishs of nodes per cluster to manage the applications having huge datasets.

  • Hardbattlee at data : A requested task can become done effectively, when the computation conpartrs place near the data. Especially where huge datasets are involved, it reddish-coloureddishuces the ne2rk traffic and incrrelayves the throughput.

Hadoop – Hdfs Operations

Starting HDFS

Initially you have to format the configureddish-coloureddish HDFS file system, open namenode (HDFS server), and execute the following command.

$ hadoop namenode -format 

After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.

$ start-dfs.sh 

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a immediateory, status of a file, uperform ‘ls’. Given becomelow is the syntax of ls tmind use you can move to a immediateory or a filename as an argument.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS

Assume we have data in the file calimmediateed file.txt in the local system which is ought to become saved in the hdfs file system. Follow the steps given becomelow to insert the requireddish-coloureddish file in the Hadoop file system.

Step 1

You have to develop an input immediateory.

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input 

Step 2

Transfer and store a data file from local systems to the Hadoop file system uperform the put command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input 

Step 3

You can verify the file uperform ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input 

Retrieving Data from HDFS

Assume we have a file in HDFS calimmediateed awayfile. Given becomelow is a fundamental demonstration for retrieving the requireddish-coloureddish file from the Hadoop file system.

Step 1

Initially, see the data from HDFS uperform cat command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/awayput/awayfile 

Step 2

Get the file from HDFS to the local file system uperform get command.

$ $HADOOP_HOME/bin/hadoop fs -get /user/awayput/ /home/hadoop_tp/ 

Shutting Down the HDFS

You can shut down the HDFS simply by uperform the following command.

$ end-dfs.sh 

Hadoop – Command Reference

HDFS Command Reference

There are many kind of more commands in "$HADOOP_HOME/bin/hadoop fs" than are demonstrated here, although these fundamental operations will get you started. Running ./bin/hadoop dfs with no additional arguments will list all the commands tmind use can become run with the FsShell system. Furthermore, $HADOOP_HOME/bin/hadoop fs -help commandName will display a short usage summary for the operation in question, if you are stuck.

A table of all the operations is shown becomelow. The following conventions are used for parameters:

"<route>" means any kind of file or immediateory name. 
"<route>..." means one or more file or immediateory names. 
"<file>" means any kind of filename. 
"<src>" and "<dest>" are route names in a immediateed operation. 
"<localSrc>" and "<localDest>" are routes as above, but on the local file system. 

All other files and route names refer to the objects inpart HDFS.

Command Description
-ls <route> Lists the contents of the immediateory specified simply by route, showing the names, permissions, owner, dimension and modification date for every enconpartr.
-lsr <route> Behaves like -ls, but recursively displays entries in all subimmediateories of route.
-du <route> Shows disk usage, in simply bytes, for all the files which go with route; filenames are resloted with the compallowe HDFS protocol prefix.
-dus <route> Like -du, but prints a summary of disk usage of all files/immediateories in the route.
-mv <src><dest> Moves the file or immediateory indicated simply by src
to dest, wislim HDFS.
-cp <src> <dest> Copies the file or immediateory identified simply by src to dest, wislim HDFS.
-rm <route> Removes the file or empty immediateory identified simply by route.
-rmr <route> Removes the file or immediateory identified simply by route. Recursively deenablees any kind of kid entries (i.e., files or subimmediateories of route).
-put <localSrc> <dest> Copies the file or immediateory from the local file system identified simply by localSrc to dest wislim the DFS.
-duplicateFromLocal <localSrc> <dest> Identical to -put
-moveFromLocal <localSrc> <dest> Copies the file or immediateory from the local file system identified simply by localSrc to dest wislim HDFS, and then deenablees the local duplicate on success.
-get [-crc] <src> <localDest> Copies the file or immediateory in HDFS identified simply by src to the local file system route identified simply by localDest.
-getmerge <src> <localDest> Retrieves all files tmind use go with the route src in HDFS, and copies all of all of them to a performle, merged file in the local file system identified simply by localDest.
-cat <filen-ame> Displays the contents of filename on stdaway.
-duplicateToLocal <src> <localDest> Identical to -get
-moveToLocal <src> <localDest> Works like -get, but deenablees the HDFS duplicate on success.
-mkdir <route> Creates a immediateory named route in HDFS.
Creates any kind of mother or father immediateories in route tmind use are misperform (e.g., mkdir -p in Linux).
-setrep [-R] [-w] rep <route> Sets the target replication fworkor for files identified simply by route to rep. (The workual replication fworkor will move tobattbrought the target over time)
-touchz <route> Creates a file at route containing the current time as a timestamp. Fails if a file already exists at route, unless the file is already dimension 0.
-check -[ezd] <route> Returns 1 if route exists; has zero size; or is a immediateory or 0 otherwise.
-stat [format] <route> Prints information abaway route. Format is a string which accepts file dimension in blocks (%b), filename (%n), block dimension (%o), replication (%r), and modification date (%y, %Y).
-tail [-f] <file2name> Shows the final 1KB of file on stdaway.
-chmod [-R] mode,mode,… <route>… Changes the file permissions associated with one or more objects identified simply by route…. Performs alters recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask.
-chown [-R] [owner][:[group]] <route>… Sets the owning user and/or group for files or immediateories identified simply by route…. Sets owner recursively if -R is specified.
chgrp [-R] group <route>… Sets the owning group for files or immediateories identified simply by route…. Sets group recursively if -R is specified.
-help <cmd-name> Returns usage information for one of the commands listed above. You must omit the leading '-' charworker in cmd.

Hadoop – Mapreddish-coloureddishuce

MapReduce is a framework uperform which we can write applications to process huge amounts of data, in parallel, on huge clusters of commodity hardbattlee in a reliable manner.

Wmind use is MapReduce?

MapReduce is a procesperform technique and a program model for distributed computing based on java. The MapReduce algorithm contains 2 imslotant tasks, namely Map and Reduce. Map conpartrs a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reddish-coloureddishuce task, which conpartrs the awayput from a chart as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implays, the reddish-coloureddishuce task is always performed after the chart job.

The major advantage of MapReduce is tmind use it is easy to range data procesperform over multiple computing nodes. Under the MapReduce model, the data procesperform primitives are calimmediateed chartpers and reddish-coloureddishucers. Decompoperform a data procesperform application into chartpers and reddish-coloureddishucers is a fewtimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over 100-coloureddishs, thougreat great sands, or also tens of thougreat great sands of machines in a cluster is merely a configuration alter. This fundamental scalpossible is wmind use has attrworked many kind of programmers to use the MapReduce model.

The Algorithm

  • Generally MapReduce paradigm is based on sending the computer to where the data reparts!

  • MapReduce program executes in 3 stages, namely chart stage, shuffle stage, and reddish-coloureddishuce stage.

    • Map stage : The chart or chartper’s job is to process the input data. Generally the input data is within the form of file or immediateory and is storeddish-coloureddish in the Hadoop file system (HDFS). The input file is moveed to the chartper function range simply by range. The chartper processes the data and develops various small chunks of data.

    • Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data tmind use comes from the chartper. After procesperform, it produces a brand new set of awayput, which will become storeddish-coloureddish in the HDFS.

  • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.

  • The framework manages all the details of data-moveing such as issuing tasks, verifying task compenableion, and duplicateing data around the cluster becometween the nodes.

  • Most of the computing conpartrs place on nodes with data on local disks tmind use reddish-coloureddishuces the ne2rk traffic.

  • After compenableion of the given tasks, the cluster collects and reddish-coloureddishuces the data to form an appropriate result, and sends it back to the Hadoop server.

Big Data

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, tmind use is, the framework sees the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the awayput of the job, conceivably of various kinds.

The key and the value coursees need to become in serialized manner simply by the framework and hence, need to implement the Writable interface. Additionally, the key coursees have to implement the Writable-Comparable interface to facilitate sorting simply by the framework. Input and Output kinds of a MapReduce job: (Input) <k1, v1> -> chart -> <k2, v2>-> reddish-coloureddishuce -> <k3, v3>(Output).

Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)

Terminology

  • PayLoad – Applications implement the Map and the Reduce functions, and form the core of the job.

  • Mapper – Mapper charts the input key/value pairs to a set of intermediate key/value pair.

  • NamedNode – Node tmind use manages the Hadoop Distributed File System (HDFS).

  • DataNode – Node where data is presented beforehand becomefore any kind of procesperform conpartrs place.

  • MasterNode – Node where JobTracker runs and which accepts job requests from claynts.

  • SlaveNode – Node where Map and Reduce program runs.

  • JobTracker – Schedules jobs and tracks the assign jobs to Task tracker.

  • Task Tracker – Tracks the task and reslots status to JobTracker.

  • Job – A program is an execution of a Mapper and Reducer amix a dataset.

  • Task – An execution of a Mapper or a Reducer on a slice of data.

  • Task Attempt – A particular instance of an attempt to execute a task on a SlaveNode.

Example Scenario

Given becomelow is the data regarding the electrical consumption of an body organization. It contains the monthly electrical consumption and the annual average for various oceansons.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

If the above data is given as input, we have to write applications to process it and produce results such as finding the oceanson of maximum usage, oceanson of minimum usage, and so on. This is a walkover for the programmers with finite numbecomer of records. They will simply write the logic to produce the requireddish-coloureddish awayput, and move the data to the application composed.

But, belayve of the data representing the electrical consumption of all the hugerange industries of a particular state, since it is formation.

When we write applications to process such bulk data,

  • They will conpartr a lot of time to execute.
  • There will become a weighty ne2rk traffic when we move data from source to ne2rk server and so on.

To solve these problems, we have the MapReduce framework.

Input Data

The above data is saved as sample.txtand given as input. The input file looks as shown becomelow.

1979   23   23   2   43   24   25   26   26   26   26   25   26  25 
1980   26   27   28  28   28   30   31   31   31   30   30   30  29 
1981   31   32   32  32   33   34   35   36   36   34   34   34  34 
1984   39   38   39  39   39   41   42   43   40   39   38   38  40 
1985   38   39   39  39   39   41   41   41   00   40   39   39  45 

Example Program

Given becomelow is the program to the sample data uperform MapReduce framework.

package hadoop; 

imslot java.util.*; 
imslot java.io.IOException; 
imslot java.io.IOException; 
imslot org.apache.hadoop.fs.Path; 
imslot org.apache.hadoop.conf.*; 
imslot org.apache.hadoop.io.*; 
imslot org.apache.hadoop.chartreddish-coloureddish.*; 
imslot org.apache.hadoop.util.*; 
public course ProcessUnit is 
{ 
   //Mapper course 
   public static course E_EMapper extends MapReduceBase implements 
   Mapper<LongWritable ,/*Input key Type */ 
   Text,                /*Input value Type*/ 
   Text,                /*Output key Type*/ 
   IntWritable>        /*Output value Type*/ 
   {  //Map function 
      public void chart(LongWritable key, Text value, 
      OutputCollector<Text, IntWritable> awayput,   
      Resloter resloter) thlines IOException 
      { 
         String range = value.toString(); 
         String finaltoken = null; 
         StringTokenizer s = brand new StringTokenizer(range,"t"); 
         String oceanson = s.nextToken(); 
         while(s.hasMoreTokens()){finaltoken=s.nextToken();} 
         int avgprice = Integer.parseInt(finaltoken); 
         awayput.collect(brand new Text(oceanson), brand new IntWritable(avgprice)); 
      } 
   } 
   //Reducer course 
   public static course E_EReduce extends MapReduceBase implements 
   Reducer< Text, IntWritable, Text, IntWritable > 
   {  //Reduce function 
      public void reddish-coloureddishuce( Text key, Iterator <IntWritable> values, 
         OutputCollector<Text, IntWritable> awayput, Resloter resloter)  thlines IOException 
         { 
            int maxavg=30; 
            int val=Integer.MIN_VALUE; 
            while (values.hasNext()) 
            { 
               if((val=values.next().get())>maxavg) 
               { 
                  awayput.collect(key, brand new IntWritable(val)); 
               } 
            } 
 
         } 
   }  
   
   //Main function 
   public static void main(String args[])thlines Exception 
   { 
      JobConf conf = brand new JobConf(Ebroughtevice's.course); 
      conf.setJobName("max_eenablericitydevice's"); 
      conf.setOutputKeyClass(Text.course);
      conf.setOutputValueClass(IntWritable.course); 
      conf.setMapperClass(E_EMapper.course); 
      conf.setCombinerClass(E_EReduce.course); 
      conf.setReducerClass(E_EReduce.course); 
      conf.setInputFormat(TextInputFormat.course); 
      conf.setOutputFormat(TextOutputFormat.course); 
      FileInputFormat.setInputPaths(conf, brand new Path(args[0])); 
      FileOutputFormat.setOutputPath(conf, brand new Path(args[1])); 
      JobClaynt.runJob(conf); 
   } 
} 

Save the above program as ProcessUnit is.java. The compilation and execution of the program is exfundamentaled becomelow.

Compilation and Execution of Process Unit is Program

Let us assume we are in the home immediateory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given becomelow to compile and execute the above program.

Step 1

The following command is to develop a immediateory to store the compiimmediateed java coursees.

$ mkdir device's 

Step 2

Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Visit down the following link /index.php?s=%20httpmvnreposit downorycomartifworkorgapachehadoophadoop-core121 to download the jar. Let us assume the downloaded folder is /home/hadoop/.

Step 3

The following commands are used for compiling the ProcessUnit is.java program and creating a jar for the program.

$ javac -courseroute hadoop-core-1.2.1.jar -d device's ProcessUnit is.java 
$ jar -cvf device's.jar -C device's/ . 

Step 4

The following command is used to develop an input immediateory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir 

Step 5

The following command is used to duplicate the input file named sample.txtin the input immediateory of HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir 

Step 6

The following command is used to verify the files in the input immediateory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/ 

Step 7

The following command is used to run the Ebroughtevice_max application simply by tacalifornia california king the input files from the input immediateory.

$HADOOP_HOME/bin/hadoop jar device's.jar hadoop.ProcessUnit is input_dir awayput_dir 

Wait for a while until the file is executed. After execution, as shown becomelow, the awayput will contain the numbecomer of input split is, the numbecomer of Map tasks, the numbecomer of reddish-coloureddishucer tasks, etc.

INFO chartreddish-coloureddishuce.Job: Job job_1414748220717_0002 
compenableed successcompallowey 
14/10/31 06:02:52 
INFO chartreddish-coloureddishuce.Job: Counters: 49 
File System Counters 
 
FILE: Numbecomer of simply bytes read=61 
FILE: Numbecomer of simply bytes composed=279400 
FILE: Numbecomer of read operations=0 
FILE: Numbecomer of huge read operations=0   
FILE: Numbecomer of write operations=0 
HDFS: Numbecomer of simply bytes read=546 
HDFS: Numbecomer of simply bytes composed=40 
HDFS: Numbecomer of read operations=9 
HDFS: Numbecomer of huge read operations=0 
HDFS: Numbecomer of write operations=2 Job Counters 


   Launched chart tasks=2  
   Launched reddish-coloureddishuce tasks=1 
   Data-local chart tasks=2  
   Total time spent simply by all charts in occupied slots (ms)=146137 
   Total time spent simply by all reddish-coloureddishuces in occupied slots (ms)=441   
   Total time spent simply by all chart tasks (ms)=14613 
   Total time spent simply by all reddish-coloureddishuce tasks (ms)=44120 
   Total vcore-2nds conpartrn simply by all chart tasks=146137 
   
   Total vcore-2nds conpartrn simply by all reddish-coloureddishuce tasks=44120 
   Total megasimply byte-2nds conpartrn simply by all chart tasks=149644288 
   Total megasimply byte-2nds conpartrn simply by all reddish-coloureddishuce tasks=45178880 
   
Map-Reduce Framework 
 
Map input records=5  
   Map awayput records=5   
   Map awayput simply bytes=45  
   Map awayput materialized simply bytes=67  
   Input split simply bytes=208 
   Combine input records=5  
   Combine awayput records=5 
   Reduce input groups=5  
   Reduce shuffle simply bytes=6  
   Reduce input records=5  
   Reduce awayput records=5  
   Spilimmediateed Records=10  
   Shuffimmediateed Maps =2  
   Faiimmediateed Shuffles=0  
   Merged Map awayputs=2  
   GC time elapsed (ms)=948  
   CPU time spent (ms)=5160  
   Physical memory (simply bytes) snapshot=47749120  
   Virtual memory (simply bytes) snapshot=2899349504  
   Total committed heap usage (simply bytes)=277684224
     
File Output Format Counters 
 
   Bytes Written=40 

Step 8

The following command is used to verify the resultant files in the awayput folder.

$HADOOP_HOME/bin/hadoop fs -ls awayput_dir/ 

Step 9

The following command is used to see the awayput in Part-00000 file. This file is generated simply by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat awayput_dir/part-00000 

Below is the awayput generated simply by the MapReduce program.

1981    34 
1984    40 
1985    45 

Step 10

The following command is used to duplicate the awayput folder from HDFS to the local file system for analyzing.

$HADOOP_HOME/bin/hadoop fs -cat awayput_dir/part-00000/bin/hadoop dfs get awayput_dir /home/hadoop 

Imslotant Commands

All Hadoop commands are invoked simply by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script withaway any kind of arguments prints the description for all commands.

Usage : hadoop [–config confdir] COMMAND

The following table lists the options available and their own particular description.

Options Description
namenode -format Formats the DFS filesystem.
2ndarynamenode Runs the DFS 2ndary namenode.
namenode Runs the DFS namenode.
datanode Runs a DFS datanode.
dfunhappymin Runs a DFS admin claynt.
mradmin Runs a Map-Reduce admin claynt.
fsck Runs a DFS filesystem checcalifornia california king utility.
fs Runs a generic filesystem user claynt.
balancer Runs a cluster balancing utility.
oiv Applays the awayrange fsimage seeer to an fsimage.
fetchdt Fetches a delegation token from the NameNode.
jobtracker Runs the MapReduce job Tracker node.
pipes Runs a Pipes job.
tasktracker Runs a MapReduce task Tracker node.
historyserver Runs job history servers as a standasingle daemon.
job Manipulates the MapReduce jobs.
queue Gets information regarding JobQueues.
version Prints the version.
jar <jar> Runs a jar file.
distcp <srcurl> <desturl> Copies file or immediateories recursively.
distcp2 <srcurl> <desturl> DistCp version 2.
archive -archiveName NAME -p Creates a hadoop archive.
<mother or father route> <src>* <dest>
courseroute Prints the course route needed to get the Hadoop jar and the requireddish-coloureddish libraries.
daemonlog Get/Set the log level for every daemon

How to Interwork with MapReduce Jobs

Usage: hadoop job [GENERIC_OPTIONS]

The following are the Generic Options available in a Hadoop job.

GENERIC_OPTIONS Description
-submit <job-file> Submit is the job.
status <job-id> Prints the chart and reddish-coloureddishuce compenableion percentage and all job counters.
counter <job-id> <group-name> <countername> Prints the counter value.
-eliminate <job-id> Kills the job.
-alsots <job-id> <fromalsot-#> <#-of-alsots> Prints the alsots' details received simply by jobtracker for the given range.
-history [all] <jobOutputDir> – history < jobOutputDir> Prints job details, faiimmediateed and eliminateed tip details. More details abaway the job such as successful tasks and task attempts made for every task can become seeed simply by specifying the [all] option.
-list[all] Displays all jobs. -list displays only jobs which are yet to compenablee.
-eliminate-task <task-id> Kills the task. Kilimmediateed tasks are NOT counted against faiimmediateed attempts.
-fail-task <task-id> Fails the task. Faiimmediateed tasks are counted against
faiimmediateed attempts.
set-priority <job-id> <priority> Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID> 
e.g. 
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004 

To see the history of job awayput-dir

$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME> 
e.g. 
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/awayput 

To eliminate the job

$ $HADOOP_HOME/bin/hadoop job -eliminate <JOB-ID> 
e.g. 
$ $HADOOP_HOME/bin/hadoop job -eliminate job_201310191043_0004 

Hadoop – Streaming

Hadoop streaming is a utility tmind use comes with the Hadoop distribution. This utility enables you to develop and run Map/Reduce jobs with any kind of executable or script as the chartper and/or the reddish-coloureddishucer.

Example Uperform Python

For Hadoop streaming, we are pondering the word-count problem. Any job in Hadoop must have 2 phases: chartper and reddish-coloureddishucer. We have composed codes for the chartper and the reddish-coloureddishucer in python script to run it below Hadoop. One can furthermore write the exaction same in Perl and Rusimply by.

Mapper Phase Code

!/usr/bin/python
imslot sys
# Input conpartrs from standard input for myrange in sys.stdin: 
# Remove whitespace possibly part myrange = myrange.strip() 
# Break the range into words words = myrange.split() 
# Iterate the words list for myword in words: 
# Write the results to standard awayput print '%st%s' % (myword, 1)

Make sure this particular file has execution permission (chmod +x /home/ expert/hadoop-1.2.1/chartper.py).

Reducer Phase Code

#!/usr/bin/python
from operator imslot itemgetter 
imslot sys 
current_word = ""
current_count = 0 
word = "" 
# Input conpartrs from standard input for myrange in sys.stdin: 
# Remove whitespace possibly part myrange = myrange.strip() 
# Split the input we got from chartper.py word, count = myrange.split('t', 1) 
# Convert count variable to integer 
   conpartr: 
      count = int(count) 
other than ValueError: 
   # Count was not a numbecomer, so silently ignore this particular range continue
if current_word == word: 
   current_count += count 
else: 
   if current_word: 
      # Write result to standard awayput print '%st%s' % (current_word, current_count) 
   current_count = count
   current_word = word
# Do not forget to awayput the final word if needed! 
if current_word == word: 
   print '%st%s' % (current_word, current_count)

Save the chartper and reddish-coloureddishucer codes in chartper.py and reddish-coloureddishucer.py in Hadoop home immediateory. Make sure these files have execution permission (chmod +x chartper.py and chmod +x reddish-coloureddishucer.py). As python is withindentation sensit downive so the exaction same code can become download from the becomelow link.

Execution of WordCount Program

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.
2.1.jar 
   -input input_dirs  
   -awayput awayput_dir  
   -chartper <route/chartper.py  
   -reddish-coloureddishucer <route/reddish-coloureddishucer.py

Where "" is used for range continuation for clear readpossible.

For Example,

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -awayput myawayput -chartper /home/expert/hadoop-1.2.1/chartper.py -reddish-coloureddishucer /home/expert/hadoop-1.2.1/reddish-coloureddishucer.py

How Streaming Works

In the above example, both the chartper and the reddish-coloureddishucer are python scripts tmind use read the input from standard input and emit the awayput to standard awayput. The utility will develop a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it compenablees.

When a script is specified for chartpers, every chartper task will launch the script as a separate process when the chartper is preliminaryized. As the chartper task runs, it converts it is inputs into ranges and give food to the ranges to the standard input (STDIN) of the process. In the meantime, the chartper collects the range-oriented awayputs from the standard awayput (STDOUT) of the process and converts every range into a key/value pair, which is collected as the awayput of the chartper. By default, the prefix of a range up to the 1st tab charworker is the key and the rest of the range (excluding the tab charworker) will become the value. If generally presently there is no tab charworker in the range, then the entire range is conpartreddish-coloureddish as the key and the value is null. However, this particular can become customised, as per one need.

When a script is specified for reddish-coloureddishucers, every reddish-coloureddishucer task will launch the script as a separate process, then the reddish-coloureddishucer is preliminaryized. As the reddish-coloureddishucer task runs, it converts it is input key/values pairs into ranges and give food tos the ranges to the standard input (STDIN) of the process. In the meantime, the reddish-coloureddishucer collects the range-oriented awayputs from the standard awayput (STDOUT) of the process, converts every range into a key/value pair, which is collected as the awayput of the reddish-coloureddishucer. By default, the prefix of a range up to the 1st tab charworker is the key and the rest of the range (excluding the tab charworker) is the value. However, this particular can become customised as per specific requirements.

Imslotant Commands

Parameters Options Description
-input immediateory/file-name Requireddish-coloureddish Input location for chartper.
-awayput immediateory-name Requireddish-coloureddish Output location for reddish-coloureddishucer.
-chartper executable or script or JavaClassName Requireddish-coloureddish Mapper executable.
-reddish-coloureddishucer executable or script or JavaClassName Requireddish-coloureddish Reducer executable.
-file file-name Optional Makes the chartper, reddish-coloureddishucer, or combiner executable available locally on the compute nodes.
-inputformat JavaClassName Optional Class you supply need to return key/value pairs of Text course. If not specified, TextInputFormat is used as the default.
-awayputformat JavaClassName Optional Class you supply need to conpartr key/value pairs of Text course. If not specified, TextOutputformat is used as the default.
-partitioner JavaClassName Optional Class tmind use figure outs which reddish-coloureddishuce a key is sent to.
-combiner streamingCommand or JavaClassName Optional Combiner executable for chart awayput.
-cmdenv name=value Optional Passes the environment variable to streaming commands.
-inputreader Optional For backbattbroughts-compatibility: specifies a record reader course (instead of an input format course).
-verbose Optional Verbose awayput.
-lazyOutput Optional Creates awayput lazily. For example, if the awayput format is based on FileOutputFormat, the awayput file is developd only on the 1st call to awayput.collect (or Context.write).
-numReduceTasks Optional Specifies the numbecomer of reddish-coloureddishucers.
-chartdebug Optional Script to call when chart task fails.
-reddish-coloureddishucedebug Optional Script to call when reddish-coloureddishuce task fails.

Hadoop – Multi Node Cluster

This chapter exfundamentals the setup of the Hadoop Multi-Node cluster on a distributed environment.

As the whole cluster cannot become demonstrated, we are exfundamentaling the Hadoop cluster environment uperform 3 systems (one master and 2 slaves); given becomelow are their own particular IP adout presently therefites.

  • Hadoop Master: 192.168.1.15 (hadoop-master)
  • Hadoop Slave: 192.168.1.16 (hadoop-slave-1)
  • Hadoop Slave: 192.168.1.17 (hadoop-slave-2)

Follow the steps given becomelow to have Hadoop Multi-Node cluster setup.

Installing Java

Java is the main prerequisit downe for Hadoop. First of all, you need to verify the existence of java in your own system uperform “java -version”. The syntax of java version command is given becomelow.

$ java -version

If everyslimg works great it will give you the following awayput.

java version "1.7.0_71" 
Java(TM) SE Runtime Environment (develop 1.7.0_71-b13) 
Java HotSpot(TM) Claynt VM (develop 25.0-b02, mixed mode)

If java is not instalimmediateed in your own system, then follow the given steps for installing java.

Step 1

Download java (JDK – X64.tar.gz) simply by visit downing the following link http://www.oracle.com/techne2rk/java/javase/downloads/jdk7-downloads-1880260.html

Then jdk-7u71-linux-x64.tar.gz will become downloaded into your own system.

Step 2

Generally you will find the downloaded java file in Downloads folder. Verify it and extrwork the jdk-7u71-linux-x64.gz file uperform the following commands.

$ cd Downloads/
$ ls
jdk-7u71-Linux-x64.gz
$ tar zxf jdk-7u71-Linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-Linux-x64.gz

Step 3

To develop java available to all the users, you have to move it to the location “/usr/local/”. Open the root, and kind the following commands.

$ su
moveword:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

exslot JAVA_HOME=/usr/local/jdk1.7.0_71
exslot PATH=PATH:$JAVA_HOME/bin

Now verify the java -version command from the terminal as exfundamentaled above. Follow the above process and install java in all your own cluster nodes.

Creating User Account

Create a system user account on both master and slave systems to use the Hadoop installation.

# useradd hadoop 
# movewd hadoop

Mapping the nodes

You have to edit hosts file in /etc/ folder on all nodes, specify the IP adout presently therefit of every system followed simply by their own particular host names.

# vi /etc/hosts
enter the following ranges in the /etc/hosts file.
192.168.1.109 hadoop-master 
192.168.1.145 hadoop-slave-1 
192.168.56.1 hadoop-slave-2

Configuring Key Based Login

Setup ssh in every node such tmind use they can communicate with one another withaway any kind of prompt for moveword.

# su hadoop 
$ ssh-keygen -t rsa 
$ ssh-duplicate-id -i ~/.ssh/id_rsa.pub [email protected] 
$ ssh-duplicate-id -i ~/.ssh/id_rsa.pub [email protected] 
$ ssh-duplicate-id -i ~/.ssh/id_rsa.pub [email protected] 
$ chmod 0600 ~/.ssh/authorised_keys 
$ exit

Installing Hadoop

In the Master server, download and install Hadoop uperform the following commands.

# mkdir /opt/hadoop 
# cd /opt/hadoop/ 
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz 
# tar -xzf hadoop-1.2.0.tar.gz 
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop 
# cd /opt/hadoop/hadoop/

Configuring Hadoop

You have to configure Hadoop server simply by macalifornia california king the following alters as given becomelow.

core-sit downe.xml

Open the core-sit downe.xml file and edit it as shown becomelow.

<configuration>
   <home> 
      <name>fs.default.name</name> 
      <value>hdfs://hadoop-master:9000/</value> 
   </home> 
   <home> 
      <name>dfs.permissions</name> 
      <value>false</value> 
   </home> 
</configuration>

hdfs-sit downe.xml

Open the hdfs-sit downe.xml file and edit it as shown becomelow.

<configuration>
   <home> 
      <name>dfs.data.dir</name> 
      <value>/opt/hadoop/hadoop/dfs/name/data</value> 
      <final>true</final> 
   </home> 

   <home> 
      <name>dfs.name.dir</name> 
      <value>/opt/hadoop/hadoop/dfs/name</value> 
      <final>true</final> 
   </home> 

   <home> 
      <name>dfs.replication</name> 
      <value>1</value> 
   </home> 
</configuration>

chartreddish-coloureddish-sit downe.xml

Open the chartreddish-coloureddish-sit downe.xml file and edit it as shown becomelow.

<configuration>
   <home> 
      <name>chartreddish-coloureddish.job.tracker</name> 
      <value>hadoop-master:9001</value> 
   </home> 
</configuration>

hadoop-env.sh

Open the hadoop-env.sh file and edit JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS as shown becomelow.

Note: Set the JAVA_HOME as per your own system configuration.

exslot JAVA_HOME=/opt/jdk1.7.0_17 exslot HADOOP_OPTS=-Djava.net.preferIPv4Stack=true exslot HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Installing Hadoop on Slave Servers

Install Hadoop on all the slave servers simply by following the given commands.

# su hadoop 
$ cd /opt/hadoop 
$ scp -r hadoop hadoop-slave-1:/opt/hadoop 
$ scp -r hadoop hadoop-slave-2:/opt/hadoop

Configuring Hadoop on Master Server

Open the master server and configure it simply by following the given commands.

# su hadoop 
$ cd /opt/hadoop/hadoop

Configuring Master Node

$ vi etc/hadoop/masters
hadoop-master

Configuring Slave Node

$ vi etc/hadoop/slaves
hadoop-slave-1 
hadoop-slave-2

Format Name Node on Hadoop Master

# su hadoop 
$ cd /opt/hadoop/hadoop 
$ bin/hadoop namenode –format
11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ 
STARTUP_MSG: Starting NameNode 
STARTUP_MSG: host = hadoop-master/192.168.1.109 
STARTUP_MSG: args = [-format] 
STARTUP_MSG: version = 1.2.0 
STARTUP_MSG: develop = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiimmediateed simply by 'hortonfo' on Mon May 6 06:59:37 UTC 2013 
STARTUP_MSG: java = 1.7.0_71 ************************************************************/ 11/10/14 10:58:08 INFO util.GSet: Computing capacity for chart BlocksMap editlog=/opt/hadoop/hadoop/dfs/name/current/edit is
………………………………………………….
………………………………………………….
…………………………………………………. 11/10/14 10:58:08 INFO common.Storage: Storage immediateory /opt/hadoop/hadoop/dfs/name has becomeen successcompallowey formatted. 11/10/14 10:58:08 INFO namenode.NameNode: 
SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15 ************************************************************/

Starting Hadoop Services

The following command is to start all the Hadoop services on the Hadoop-Master.

$ cd $HADOOP_HOME/sbin
$ start-all.sh

Adding a New DataNode in the Hadoop Cluster

Given becomelow are the steps to become followed for adding brand new nodes to a Hadoop cluster.

Ne2rcalifornia california king

Add brand new nodes to an existing Hadoop cluster with a few appropriate ne2rk configuration. Assume the following ne2rk configuration.

For New node Configuration:

IP adout presently therefit : 192.168.1.103 
netmask : 255.255.255.0
hostname : slave3.in

Adding User and SSH Access

Add a User

On a brand new node, add "hadoop" user and set moveword of Hadoop user to "hadoop123" or any kind ofslimg you want simply by uperform the following commands.

useradd hadoop
movewd hadoop

Setup Password less connectivity from master to brand new slave.

Execute the following on the master

mkdir -p $HOME/.ssh 
chmod 700 $HOME/.ssh 
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa 
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorised_keys 
chmod 644 $HOME/.ssh/authorised_keys
Copy the public key to brand new slave node in hadoop user $HOME immediateory
scp $HOME/.ssh/id_rsa.pub [email protected]:/home/hadoop/

Execute the following on the slaves

Login to hadoop. If not, login to hadoop user.

su hadoop ssh -X [email protected]

Copy the content of public key into file "$HOME/.ssh/authorised_keys" and then alter the permission for the exaction same simply by executing the following commands.

cd $HOME
mkdir -p $HOME/.ssh 
chmod 700 $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorised_keys 
chmod 644 $HOME/.ssh/authorised_keys

Check ssh login from the master machine. Now check if you can ssh to the brand new node withaway a moveword from the master.

ssh [email protected] or [email protected]

Set Hostname of New Node

You can set hostname in file /etc/sysconfig/ne2rk

On brand new slave3 machine
NETWORKING=yes 
HOSTNAME=slave3.in

To develop the alters effective, possibly restart the machine or run hostname command to a brand new machine with the respective hostname (restart is a good option).

On slave3 node machine:

hostname slave3.in

Update /etc/hosts on all machines of the cluster with the following ranges:

192.168.1.102 slave3.in slave3

Now conpartr to ping the machine with hostnames to check whether it is resolving to IP or not.

On brand new node machine:

ping master.in

Start the DataNode on New Node

Start the datanode daemon manually uperform $HADOOP_HOME/bin/hadoop-daemon.sh script. It will automatically con2rk the master (NameNode) and sign up for the cluster. We need to furthermore add the brand new node to the conf/slaves file in the master server. The script-based commands will recognise the brand new node.

Login to brand new node

su hadoop or ssh -X [email protected]

Start HDFS on a brand newly added slave node simply by uperform the following command

./bin/hadoop-daemon.sh start datanode

Check the awayput of jps command on a brand new node. It looks as follows.

$ jps
7141 DataNode
10312 Jps

Removing a DataNode from the Hadoop Cluster

We can remove a node from a cluster on the fly, while it is running, withaway any kind of data loss. HDFS provides a decommissioning feature, which ensures tmind use removing a node is performed securely. To use it, follow the steps as given becomelow:

Step 1

Login to master.

Login to master machine user where Hadoop is withinstalimmediateed.

$ su hadoop

Step 2

Change cluster configuration.

An exclude file must become configureddish-coloureddish becomefore starting the cluster. Add a key named dfs.hosts.exclude to our $HADOOP_HOME/etc/hadoop/hdfs-sit downe.xml file. The value associated with this particular key provides the compallowe route to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

For example, add these ranges to etc/hadoop/hdfs-sit downe.xml file.

<home> 
   <name>dfs.hosts.exclude</name> 
   <value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value> 
   <description>DFS exclude</description> 
</home>

Step 3

Determine hosts to decommission.

Each machine to become decommissioned need to become added to the file identified simply by the hdfs_exclude.txt, one domain name per range. This will pralsot all of all of them from connecting to the NameNode. Content of the "/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt" file is shown becomelow, if you want to remove DataNode2.

slave2.in

Step 4

Force configuration reload.

Run the command "$HADOOP_HOME/bin/hadoop dfunhappymin -refreshNodes" withaway the quotes.

$ $HADOOP_HOME/bin/hadoop dfunhappymin -refreshNodes

This will force the NameNode to re-read it is configuration, including the brand newly updated ‘excludes’ file. It will decommission the nodes over a period of time, enableing time for every node's blocks to become replicated onto machines which are scheduimmediateed to remain workive.

On slave2.in, check the jps command awayput. After a few time, you will see the DataNode process is shutdown automatically.

Step 5

Shutdown nodes.

After the decommission process has becomeen compenableed, the decommissioned hardbattlee can become securely shut down for maintenance. Run the reslot command to dfunhappymin to check the status of decommission. The following command will describecome the status of the decommission node and the connected nodes to the cluster.

$ $HADOOP_HOME/bin/hadoop dfunhappymin -reslot

Step 6

Edit excludes file again.

Once the machines have becomeen decommissioned, they can become removed from the ‘excludes’ file. Running "$HADOOP_HOME/bin/hadoop dfunhappymin -refreshNodes" again will read the excludes file back into the NameNode; enableing the DataNodes to resign up for the cluster after the maintenance has becomeen compenableed, or additional capacity is needed in the cluster again, etc.

Special Note: If the above process is followed and the tasktracker process is still running on the node, it needs to become shut down. One way is to disconnect the machine as we did in the above steps. The Master will recognise the process automatically and will declare as dead. There is no need to follow the exaction same process for removing the tasktracker becomecause it is NOT a lot crucial as compareddish-coloureddish to the DataNode. DataNode contains the data tmind use you want to remove securely withaway any kind of loss of data.

The tasktracker can become run/shutdown on the fly simply by the following command at any kind of stage of time.

$ $HADOOP_HOME/bin/hadoop-daemon.sh end tasktracker $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

NO COMMENTS

LEAVE A REPLY