Apache Pig

0
23

Apache Pig – Overwatch

What is Apache Pig?

Apache Pig is an abstractionionion over MapReduce. It is a tool/platform which is used to analyze huger sets of data representing all of them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop uperform Apache Pig.

To write data analysis programs, Pig provides a high-level language belowstandn as Pig Latin. This language provides various operators uperform which programmers can produce their own own functions for reading, writing, and procesperform data.

To analyze data uperform Apache Pig, programmers need to write scripts uperform Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component belowstandn as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?

Programmers who are not so great at Java normally used to struggle worruler with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.

  • Uperform Pig Latin, programmers can perform MapReduce tasks easily withaway having to type complex codes in Java.

  • Apache Pig uses multi-query approach, generally there’simply by crimson-coloucrimsonucing the dimension of codes. For example, an operation that would require you to type 200 seriess of code (LoC) in Java can be easily done simply by typing as less as simply 10 LoC in Apache Pig. Ultimately Apache Pig crimson-coloucrimsonuces the producement time simply by althe majority of 16 times.

  • Pig Latin is SQL-like language and it is easy to find out Apache Pig when you are familiar with SQL.

  • Apache Pig provides many built-in operators to supinterface data operations like sign up fors, filters, ordering, etc. In addition, it furthermore provides nested data types like tuples, bags, and charts that are misperform from MapReduce.

Features of Pig

Apache Pig comes with the folloearng features −

  • Rich set of operators − It provides many operators to perform operations like sign up for, sort, filer, etc.

  • Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are great at SQL.

  • Optimization opinterfacedeviceies − The tasks in Apache Pig optimize their own execution automatically, so the programmers need to focus only on semantics of the language.

  • Extensibility − Uperform the existing operators, users can produce their own own functions to read, process, and write data.

  • UDF’s − Pig provides the facility to produce User-degreatd Functions in other programming languages such as Java and invoke or emmattress all of them in Pig Scripts.

  • Handles all kinds of data − Apache Pig analyzes all kinds of data, both structucrimson-coloucrimson as well as unstructucrimson-coloucrimson. It stores the results in HDFS.

Apache Pig Vs MapReduce

Listed below are the major differences between Apache Pig and MapReduce.

Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data procesperform paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is pretty easy. It is very difficult in MapReduce to perform a Join operation between datasets.
Any novice programmer with a fundamental belowstandladvantage of SQL can work conveniently with Apache Pig. Exposure to Java is must to work with MapReduce.
Apache Pig uses multi-query approach, generally there’simply by crimson-coloucrimsonucing the dimension of the codes to a great extent. MapReduce will require althe majority of 20 times more the number of seriess to perform the exaction same task.
There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a lengthy compilation process.

Apache Pig Vs SQL

Listed below are the major differences between Apache Pig and SQL.

Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can store data withaway designing a schema (values are stocrimson-coloucrimson as $01, $02 etc.) Schema is mandatory in SQL.
The data model in Apache Pig is nested relational. The data model used in SQL is flat relational.
Apache Pig provides limited opinterfacedevicey for Query optimization. There is more opinterfacedevicey for query optimization in SQL.

In addition to above differences, Apache Pig Latin −

  • Allows split is in the pipeseries.
  • Allows produceers to store data anywhere in the pipeseries.
  • Declares execution plans.
  • Provides operators to perform ETL (Extractionion, Transform, and Load) functions.

Apache Pig Vs Hive

Both Apache Pig and Hive are used to produce MapReduce jobs. And in a few cases, Hive operates on HDFS in a similar way Apache Pig does. In the folloearng table, we have listed a couple of significan not stages that set Apache Pig apart from Hive.

Apache Pig Hive
Apache Pig uses a language calimmediateed Pig Latin. It was firstly produced at Yahoo. Hive uses a language calimmediateed HiveQL. It was firstly produced at Facebook.
Pig Latin is a data flow language. HiveQL is a query procesperform language.
Pig Latin is a procedural language and it fit is in pipeseries paradigm. HiveQL is a declarative language.
Apache Pig can handle structucrimson-coloucrimson, unstructucrimson-coloucrimson, and semi-structucrimson-coloucrimson data. Hive is the majority ofly for structucrimson-coloucrimson data.

Applications of Apache Pig

Apache Pig is generally used simply by data scientists for performing tasks involving ad-hoc procesperform and quick prototyping. Apache Pig is used −

  • To process huge data sources such as web logs.
  • To perform data procesperform for relookup platforms.
  • To process time sensit downive data loads.

Apache Pig – History

In 2006, Apache Pig was produceed as a rerelookup project at Yahoo, especially to produce and execute MapReduce jobs on every dataset. In 2007, Apache Pig was open up sourced via Apache incubator. In 2008, the very initial relrelayve of Apache Pig came away. In 2010, Apache Pig graduated as an Apache top-level project.

Apache Pig – Architecture

The language used to analyze data in Hadoop uperform Pig is belowstandn as Pig Latin. It is a highlevel data procesperform language which provides a wealthy set of data types and operators to perform various operations on the data.

To perform a particular task Programmers uperform Pig, programmers need to write a Pig script uperform the Pig Latin language, and execute all of them uperform any of the execution mechanisms (Goperatet Shell, UDFs, Emmattressded). After execution, these scripts will go through a series of transformations applayd simply by the Pig Framework, to produce the desicrimson-coloucrimson awayput.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it produces the programmer’s job easy. The architecture of Apache Pig is shown below.

Apache Pig Architecture

Apache Pig Components

As shown in the figure, generally there are various components in the Apache Pig framework. Let us conaspectr a look at the major components.

Parser

Initially the Pig Scripts are handimmediateed simply by the Parser. It checks the syntax of the script, does type checruler, and other miscellularaneous checks. The awayput of the parser will be a DAG (immediateed acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as advantages.

Optimizer

The logical plan (DAG) is compalloweed to the logical optimizer, which carries away the logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desicrimson-coloucrimson results.

Pig Latin Data Model

The data model of Pig Latin is compallowey nested and it allows complex non-atomic datatypes such as chart and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.

Data Model

Atom

Any performle value in Pig Latin, irrespective of their own data, type is belowstandn as an Atom. It is stocrimson-coloucrimson as string and can be used as string and number. int, lengthy, float, double, chararray, and simply bytearray are the atomic values of Pig. A piece of data or a easy atomic value is belowstandn as a field.

Example − ‘raja’ or ‘30’

Tuple

A record that is formed simply by an ordecrimson-coloucrimson set of fields is belowstandn as a tuple, the fields can be of any type. A tuple is similar to a line in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordecrimson-coloucrimson set of tuples. In other words, a collection of tuples (non-unique) is belowstandn as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented simply by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not required that every tuple contain the exaction same number of fields or that the fields in the exaction same posit downion (column) have the exaction same type.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is belowstandn as internal bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map

A chart (or data chart) is a set of key-value pairs. The key needs to be of type chararray and need to be unique. The value may be of any type. It is represented simply by ‘[]’

Example − [name#Raja, age#30]

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordecrimson-coloucrimson (generally there is no guarantee that tuples are processed in any particular order).

Apache Pig – Installation

This chapter exfundamentals the how to download, install, and set up Apache Pig in your own own system.

Prerequisit downes

It is essential that you have Hadoop and Java instalimmediateed on your own own system before you go for Apache Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java simply by folloearng the steps given in the folloearng link −

/index.php?s=httpwwwtutorialsstagecomhadoophadoopen upviornmentsetuphtm

Download Apache Pig

First of all, download the lacheck version of Apache Pig from the folloearng websit downe − /index.php?s=httpspigapacheorg

Step 1

Open the homepage of Apache Pig websit downe. Under the section News, click on the link relrelayve page as shown in the folloearng snapshot.

Home Page

Step 2

On clicruler the specified link, you will be crimson-coloureimmediateed to the Apache Pig Relrelayves page. On this particular page, below the Download section, you will have 2 links, namely, Pig 0.8 and later and Pig 0.7 and before. Click on the link Pig 0.8 and later, then you will be crimson-coloureimmediateed to the page having a set of mirrors.

Apache Pig Relrelayves

Step 3

Choose and click any one of these mirrors as shown below.

Click Mirrors

Step 4

These mirrors will conaspectr you to the Pig Relrelayves page. This page contains various versions of Apache Pig. Click the lacheck version among all of them.

Pig Relrelayve

Step 5

Wislim these folders, you will have the source and binary files of Apache Pig in various distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Index

Install Apache Pig

After downloading the Apache Pig smoothbattlee, install it in your own own Linux environment simply by folloearng the steps given below.

Step 1

Create a immediateory with the name Pig in the exaction same immediateory where the installation immediateories of Hadoop, Java, and other smoothbattlee were instalimmediateed. (In our tutorial, we have produced the Pig immediateory in the user named Hadoop).

$ mkdir Pig

Step 2

Extractionion the downloaded tar files as shown below.

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz 

Step 3

Move the content of pig-0.15.0-src.tar.gz file to the Pig immediateory produced earlayr as shown below.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit 2 files − bashrc and pig.properlinks.

.bashrc file

In the .bashrc file, set the folloearng variables −

  • PIG_HOME folder to the Apache Pig’s installation folder,

  • PATH environment variable to the bin folder, and

  • PIG_CLASSPATH environment variable to the etc (configuration) folder of your own own Hadoop installations (the immediateory that contains the core-sit downe.xml, hdfs-sit downe.xml and chartcrimson-coloucrimson-sit downe.xml files).

exinterface PIG_HOME = /home/Hadoop/Pig
exinterface PATH  = PATH:/home/Hadoop/pig/bin
exinterface PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properlinks file

In the conf folder of Pig, we have a file named pig.properlinks. In the pig.properlinks file, you can set various parameters as given below.

pig -h properlinks 

The folloearng properlinks are supinterfaceed −

Logging: verbose = true|false; default is false. This home is the exaction same as -v
       switch short=true|false; default is false. This home is the exaction same 
       as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.             
       This home is the exaction same as -d switch aggregate.battlening = true|false; default is true. 
       If true, prints count of battlenings of every type instead than logging every battlening.		 
		 
Performance tuning: pig.cachedbag.memusage=<mem fractionionion>; default is 0.2 (20% of all memory).
       Note that this particular memory is shacrimson-coloucrimson amix all huge bags used simply by the application.         
       pig.skewedsign up for.crimson-coloucrimsonuce.memusagea=<mem fractionionion>; default is 0.3 (30% of all memory).
       Specifies the fractionionion of heap available for the crimson-coloucrimsonucer to perform the sign up for.
       pig.exec.nocombiner = true|false; default is false.
           Only disable combiner as a temporary workaround for issues.         
       opt.multiquery = true|false; multiquery is on simply by default.
           Only disable multiquery as a temporary workaround for issues.
       opt.fetch=true|false; fetch is on simply by default.
           Scripts containing Filter, Forevery, Limit, Stream, and Union can be dumped withaway MR jobs.         
       pig.tmpfilecompression = true|false; compression is away from simply by default.             
           Determines whether awayput of intermediate jobs is compressed.         
       pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
           Used in conjunction with pig.tmpfilecompression. Degreats compression type.         
       pig.noSplitCombination = true|false. Split combination is on simply by default.
           Determines if multiple small files are combined into a performle chart.         
			  
       pig.exec.chartPartAgg = true|false. Default is false.             
           Determines if partial aggregation is done wislim chart phase, before records are sent to combiner.         
       pig.exec.chartPartAgg.minReduction=<min aggregation truthionor>. Default is 10.             
           If the in-chart partial aggregation does not crimson-coloucrimsonuce the awayput num records simply by this particular truthionor, it gets disabimmediateed.
			  
Miscellularaneous: exectype = chartcrimson-coloucrimsonuce|tez|local; default is chartcrimson-coloucrimsonuce. This home is the exaction same as -x switch
       pig.additional.jars.uris=<comma separated list of jars>. Used in place of register command.
       udf.iminterface.list=<comma separated list of iminterfaces>. Used to avoid package names in UDF.
       prevent.on.failure = true|false; default is false. Set to true to terminate on the very initial error.         
       pig.datetime.default.tz=<UTC time away fromset>. e.g. +08:00. Default is the default timezone of the host.
           Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop home can be specified.

Verifying the Installation

Verify the installation of Apache Pig simply by typing the version command. If the installation is successful, you will get the version of Apache Pig as shown below.

$ pig –version 
 
Apache Pig version 0.15.0 (r1682971)  
compiimmediateed Jun 01 2015, 11:44:35

Apache Pig – Execution

In the previous chapter, we exfundamentaled how to install Apache Pig. In this particular chapter, we will discuss how to execute Apache Pig.

Apache Pig Execution Modes

You can operate Apache Pig in 2 modes, namely, Local Mode and HDFS mode.

Local Mode

In this particular mode, all the files are instalimmediateed and operate from your own own local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for checking purpose.

MapReduce Mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) uperform Apache Pig. In this particular mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can be executed in 3 ways, namely, interactionionive mode, batch mode, and emmattressded mode.

  • Interactionionive Mode (Goperatet shell) − You can operate Apache Pig in interactionionive mode uperform the Goperatet shell. In this particular shell, you can enter the Pig Latin statements and get the awayput (uperform Dump operator).

  • Batch Mode (Script) − You can operate Apache Pig in Batch mode simply by writing the Pig Latin script in a performle file with .pig extension.

  • Emmattressded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Degreatd Functions) in programming languages such as Java, and uperform all of them in our script.

Invoruler the Goperatet Shell

You can invoke the Goperatet shell in a desicrimson-coloucrimson mode (local/MapReduce) uperform the −x option as shown below.

Local mode MapReduce mode

Command −

$ ./pig –x local

Command −

$ ./pig -x chartcrimson-coloucrimsonuce

Output

Local Mode Output

Output

MapReduce Mode Output

Either of these commands gives you the Goperatet shell prompt as shown below.

goperatet>

You can exit the Goperatet shell uperform ‘ctrl + d’.

After invoruler the Goperatet shell, you can execute a Pig script simply by immediately entering the Pig Latin statements in it.

goperatet> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it uperform the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as shown below.

Sample_script.pig

pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil.txt' USING
   PigStorage(',') as (id:int,name:chararray,city:chararray);
  
Dump pupil;

Now, you can execute the script in the above file as shown below.

Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x chartcrimson-coloucrimsonuce Sample_script.pig

Note − We will discuss in detail how to operate a Pig script in Bach mode and in emmattressded mode in subsequent chapters.

Apache Pig – Goperatet Shell

After invoruler the Goperatet shell, you can operate your own own Pig scripts in the shell. In addition to that, generally there are specific helpful shell and utility commands provided simply by the Goperatet shell. This chapter exfundamentals the shell and utility commands provided simply by the Goperatet shell.

Note − In a few interfaceions of this particular chapter, the commands like Load and Store are used. Refer the respective chapters to get in-detail information on all of them.

Shell Commands

The Goperatet shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke any shell commands uperform sh and fs.

sh Command

Uperform sh command, we can invoke any shell commands from the Goperatet shell. Uperform sh command from the Goperatet shell, we cannot execute the commands that are a part of the shell environment (ex − cd).

Syntax

Given below is the syntax of sh command.

goperatet> sh shell command parameters

Example

We can invoke the ls command of Linux shell from the Goperatet shell uperform the sh option as shown below. In this particular example, it lists away the files in the /pig/bin/ immediateory.

goperatet> sh ls
   
pig 
pig_1444799121955.log 
pig.cmd 
pig.py

fs Command

Uperform the fs command, we can invoke any FsShell commands from the Goperatet shell.

Syntax

Given below is the syntax of fs command.

goperatet> sh File System command parameters

Example

We can invoke the ls command of HDFS from the Goperatet shell uperform fs command. In the folloearng example, it lists the files in the HDFS main immediateory.

goperatet> fs –ls
  
Found 3 items
drwxrwxrwx   - Hadoop supergroup          0 2015-09-08 14:13 Hbase
drwxr-xr-x   - Hadoop supergroup          0 2015-09-09 14:52 seqgen_data
drwxr-xr-x   - Hadoop supergroup          0 2015-09-08 11:30 twitter_data

In the exaction same way, we can invoke all the other file system shell commands from the Goperatet shell uperform the fs command.

Utility Commands

The Goperatet shell provides a set of utility commands. These include utility commands such as clear, help, background, quit, and set; and commands such as exec, eliminate, and operate to manage Pig from the Goperatet shell. Given below is the description of the utility commands provided simply by the Goperatet shell.

clear Command

The clear command is used to clear the screen of the Goperatet shell.

Syntax

You can clear the screen of the goperatet shell uperform the clear command as shown below.

goperatet> clear

help Command

The help command gives you a list of Pig commands or Pig properlinks.

Usage

You can get a list of Pig commands uperform the help command as shown below.

goperatet> help

Commands: <pig latin statement>; - See the PigLatin manual for details:
http://hadoop.apache.org/pig
  
File system commands:fs <fs arguments> - Equivalent to Hadoop dfs  command:
http://hadoop.apache.org/common/docs/current/hdfs_shell.html
	 
Diagnostic Commands:describe <alias>[::<alias] - Show the schema for the alias.
Inner aliases can be descrimattress as A::B.
    exfundamental [-script <pigscript>] [-away <route>] [-short] [-dot|-xml] 
       [-param <param_name>=<pCram_value>]
       [-param_file <file_name>] [<alias>] - 
       Show the execution plan to compute the alias or for entire script.
       -script - Exfundamental the entire script.
       -away - Store the awayput into immediateory instead than print to stdaway.
       -short - Don't expand nested plans (presenting a smaller graph for overwatch).
       -dot - Generate the awayput in .dot format. Default is text format.
       -xml - Generate the awayput in .xml format. Default is text format.
       -param <param_name - See parameter substitution for details.
       -param_file <file_name> - See parameter substitution for details.
       alias - Alias to exfundamental.
       dump <alias> - Compute the alias and writes the results to stdaway.

Utility Commands: exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -
       Execute the script with access to goperatet environment including aliases.
       -param <param_name - See parameter substitution for details.
       -param_file <file_name> - See parameter substitution for details.
       script - Script to be executed.
    operate [-param <param_name>=param_value] [-param_file <file_name>] <script> -
       Execute the script with access to goperatet environment.
		 -param <param_name - See parameter substitution for details.         
       -param_file <file_name> - See parameter substitution for details.
       script - Script to be executed.
    sh  <shell command> - Invoke a shell command.
    eliminate <job_id> - Kill the hadoop job specified simply by the hadoop job id.
    set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensit downive.
       The folloearng keys are supinterfaceed:
       default_parallel - Script-level crimson-coloucrimsonuce parallelism. Basic input dimension heuristics used 
       simply by default.
       debug - Set debug on or away from. Default is away from.
       job.name - Single-quoted name for jobs. Default is PigLatin:<script name>     
       job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high.
       Default is normal stream.skiproute - String that contains the route.
       This is used simply by streaming any hadoop home.
    help - Display this particular message.
    background [-n] - Display the list statements in cache.
       -n Hide series numbers.
    quit - Quit the goperatet shell. 

background Command

This command displays a list of statements executed / used so far since the Goperatet sell is invoked.

Usage

Assume we have executed 3 statements since open uping the Goperatet shell.

goperatet> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');
 
goperatet> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
 
goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil.txt' USING PigStorage(',');
 

Then, uperform the background command will produce the folloearng awayput.

goperatet> background

customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(','); 
  
orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
   
pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil.txt' USING PigStorage(',');
 

set Command

The set command is used to show/bumign values to keys used in Pig.

Usage

Uperform this particular command, you can set values to the folloearng keys.

Key Description and values
default_parallel You can set the number of crimson-coloucrimsonucers for a chart job simply by compalloweing any whole number as a value to this particular key.
debug You can turn away from or turn on the debugging freature in Pig simply by compalloweing on/away from to this particular key.
job.name You can set the Job name to the requicrimson-coloucrimson job simply by compalloweing a string value to this particular key.
job.priority

You can set the job priority to a job simply by compalloweing one of the folloearng values to this particular key −

  • very_low
  • low
  • normal
  • high
  • very_high
stream.skiproute For streaming, you can set the route from where the data is not to be transfercrimson-coloucrimson, simply by compalloweing the desicrimson-coloucrimson route in the form of a string to this particular key.

quit Command

You can quit from the Goperatet shell uperform this particular command.

Usage

Quit from the Goperatet shell as shown below.

goperatet> quit

Let us now conaspectr a look at the commands uperform which you can manage Apache Pig from the Goperatet shell.

exec Command

Uperform the exec command, we can execute Pig scripts from the Goperatet shell.

Syntax

Given below is the syntax of the utility command exec.

goperatet> exec [–param param_name = param_value] [–param_file file_name] [script]

Example

Let us bumume generally there is a file named pupil.txt in the /pig_data/ immediateory of HDFS with the folloearng content.

Student.txt

001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, bumume we have a script file named sample_script.pig in the /pig_data/ immediateory of HDFS with the folloearng content.

Sample_script.pig

pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil.txt' USING PigStorage(',') 
   as (id:int,name:chararray,city:chararray);
  
Dump pupil;

Now, allow us execute the above script from the Goperatet shell uperform the exec command as shown below.

goperatet> exec /sample_script.pig

Output

The exec command executes the script in the sample_script.pig. As immediateed in the script, it loads the pupil.txt file into Pig and gives you the result of the Dump operator displaying the folloearng content.

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi) 

eliminate Command

You can eliminate a job from the Goperatet shell uperform this particular command.

Syntax

Given below is the syntax of the eliminate command.

goperatet> eliminate JobId

Example

Suppose generally there is a operatening Pig job having id Id_0055, you can eliminate it from the Goperatet shell uperform the eliminate command, as shown below.

goperatet> eliminate Id_0055

operate Command

You can operate a Pig script from the Goperatet shell uperform the operate command

Syntax

Given below is the syntax of the operate command.

goperatet> operate [–param param_name = param_value] [–param_file file_name] script

Example

Let us bumume generally there is a file named pupil.txt in the /pig_data/ immediateory of HDFS with the folloearng content.

Student.txt

001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, bumume we have a script file named sample_script.pig in the local filesystem with the folloearng content.

Sample_script.pig

pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil.txt' USING
   PigStorage(',') as (id:int,name:chararray,city:chararray);

Now, allow us operate the above script from the Goperatet shell uperform the operate command as shown below.

goperatet> operate /sample_script.pig

You can see the awayput of the script uperform the Dump operator as shown below.

goperatet> Dump;

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)

Note − The difference between exec and the operate command is that if we use operate, the statements from the script are available in the command background.

Pig Latin – Basics

Pig Latin is the language used to analyze data in Hadoop uperform Apache Pig. In this particular chapter, we are going to discuss the fundamentals of Pig Latin such as Pig Latin statements, data types, general and relational operators, and Pig Latin UDF’s.

Pig Latin – Data Model

As discussed in the previous chapters, the data model of Pig is compallowey nested. A Relation is the awayerthe majority of structure of the Pig Latin data model. And it is a bag where −

  • A bag is a collection of tuples.
  • A tuple is an ordecrimson-coloucrimson set of fields.
  • A field is a piece of data.

Pig Latin – Statemets

While procesperform data uperform Pig Latin, statements are the fundamental constructs.

  • These statements work with relations. They include expressions and schemas.

  • Every statement ends with a semicolon (;).

  • We will perform various operations uperform operators provided simply by Pig Latin, through statements.

  • Except LOAD and STORE, while performing all other operations, Pig Latin statements conaspectr a relation as input and produce another relation as awayput.

  • As soon as you enter a Load statement in the Goperatet shell, it is semantic checruler will be carried away. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will be carried away.

Example

Given below is a Pig Latin statement, which loads data to Apache Pig.

goperatet> Student_data = LOAD 'pupil_data.txt' USING PigStorage(',')as 
   ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Pig Latin – Data types

Given below table describes the Pig Latin data types.

S.N. Data Type Description & Example
1 int

Represents a signed 32-bit integer.

Example : 8

2 lengthy

Represents a signed 64-bit integer.

Example : 5L

3 float

Represents a signed 32-bit floating stage.

Example : 5.5F

4 double

Represents a 64-bit floating stage.

Example : 10.5

5 chararray

Represents a charactionioner array (string) in Unicode UTF-8 format.

Example : ‘tutorials stage’

6 Bytearray

Represents a Byte array (blob).

7 Boolean

Represents a Boolean value.

Example : true/ false.

8 Datetime

Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger

Represents a Java BigInteger.

Example : 60708090709

10 Bigdecimal

Represents a Java BigDecimal

Example : 185.98376256272893883

Complex Types
11 Tuple

A tuple is an ordecrimson-coloucrimson set of fields.

Example : (raja, 30)

12 Bag

A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

13 Map

A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]

Null Values

Values for all the above data types can be NULL. Apache Pig treats null values in a similar way as SQL does.

A null can be an unbelowstandn value or a non-existent value. It is used as a placeholder for optional values. These nulls can occur naturally or can be the result of an operation.

Pig Latin – Arithmetic Operators

The folloearng table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20.

Operator Description Example
+

Addition − Adds values on possibly part of the operator

a + b will give 30

Subtractionionion − Subtractionions correct hand operand from left hand operand

a − b will give −10
*

Multiplication − Multiplays values on possibly part of the operator

a * b will give 200
/

Division − Divides left hand operand simply by correct hand operand

b / a will give 2
%

Modulus − Divides left hand operand simply by correct hand operand and returns remainder

b % a will give 0
? :

Bincond − Evaluates the Boolean operators. It has 3 operands as shown below.

variable x = (expression) ? value1 if true : value2 if false.

b = (a == 1)? 20: 30;

if a=1 the value of b is 20.

if a!=1 the value of b is 30.

CASE

WHEN

THEN

ELSE END

Case − The case operator is equivalent to nested bincond operator.

CASE f2 % 2

WHEN 0 THEN 'furthermore'

WHEN 1 THEN 'odd'

END

Pig Latin – Comparison Operators

The folloearng table describes the comparison operators of Pig Latin.

Operator Description Example
==

Equal − Checks if the values of 2 operands are equal or not; if yes, then the condition becomes true.

(a = b) is not true
!=

Not Equal − Checks if the values of 2 operands are equal or not. If the values are not equal, then condition becomes true.

(a != b) is true.
>

Greater than − Checks if the value of the left operand is greater than the value of the correct operand. If yes, then the condition becomes true.

(a > b) is not true.
<

Less than − Checks if the value of the left operand is less than the value of the correct operand. If yes, then the condition becomes true.

(a < b) is true.
>=

Greater than or equal to − Checks if the value of the left operand is greater than or equal to the value of the correct operand. If yes, then the condition becomes true.

(a >= b) is not true.
<=

Less than or equal to − Checks if the value of the left operand is less than or equal to the value of the correct operand. If yes, then the condition becomes true.

(a <= b) is true.
go withes

Pattern go withing − Checks whether the string in the left-hand part go withes with the constant in the correct-hand part.

f1 go withes '.*tutorial.*'

Pig Latin – Type Construction Operators

The folloearng table describes the Type construction operators of Pig Latin.

Operator Description Example
()

Tuple constructor operator − This operator is used to construct a tuple.

(Raju, 30)
{}

Bag constructor operator − This operator is used to construct a bag.

{(Raju, 30), (Mohammad, 45)}
[]

Map constructor operator − This operator is used to construct a tuple.

[name#Raja, age#30]

Pig Latin – Relational Operations

The folloearng table describes the relational operators of Pig Latin.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted lines from a relation.
DISTINCT To remove duplicate lines from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation uperform an external program.
Grouping and Joining
JOIN To sign up for 2 or more relations.
COGROUP To group the data in 2 or more relations.
GROUP To group the data in a performle relation.
CROSS To produce the mix item of 2 or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited number of tuples from a relation.
Combining and Splitting
UNION To combine 2 or more relations into a performle relation.
SPLIT To split a performle relation into 2 or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
EXPLAIN To watch the logical, physical, or MapReduce execution plans to compute a relation.
ILLUSTRATE To watch the step-simply by-step execution of a series of statements.

Apache Pig – Reading Data

In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes huge datasets that exist in the Hadoop File System. To analyze data uperform Apache Pig, we have to preliminaryly load the data into Apache Pig. This chapter exfundamentals how to load data to Apache Pig from HDFS.

Preparing HDFS

In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, allow us start HDFS and produce the folloearng sample data in HDFS.

Student ID First Name Last Name Phone City
001 Rajiv Reddy 9848022337 Hyderabad
002 siddarth Battacharya 9848022338 Kolkata
003 Rajesh Khanna 9848022339 Delhi
004 Preethi Agarwal 9848022330 Pune
005 Trupthi Mohanthy 9848022336 Bhuwaneshbattle
006 Archana Mishra 9848022335 Chennai

The above dataset contains private details like id, very initial name, final name, phone number and city, of six pupils.

Step 1: Verifying Hadoop

First of all, verify the installation uperform Hadoop version command, as shown below.

$ hadoop version

If your own own system contains Hadoop, and if you have set the PATH variable, then you will get the folloearng awayput −

Hadoop 2.6.0 
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 
Compiimmediateed simply by jenkins on 2014-11-13T21:10Z 
Compiimmediateed with protoc 2.5.0 
From source with checksum 18e43357c8f927c0695f1e9522859d6a 
This command was operate uperform /home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar

Step 2: Starting HDFS

Blinese through the sbin immediateory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below.

cd /$Hadoop_Home/sbin/ 
$ start-dfs.sh 
localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.away 
localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.away 
Starting 2ndary namenodes [0.0.0.0] 
starting 2ndarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoop2ndarynamenode-localhost.localdomain.away
 
$ start-yarn.sh 
starting yarn daemons 
starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.away 
localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.away

Step 3: Create a Directory in HDFS

In Hadoop DFS, you can produce immediateories uperform the command mkdir. Create a brand brand new immediateory in HDFS with the name Pig_Data in the requicrimson-coloucrimson route as shown below.

$cd /$Hadoop_Home/bin/ 
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data 

Step 4: Placing the data in HDFS

The input file of Pig contains every tuple/record in individual seriess. And the entilinks of the record are separated simply by a delimiter (In our example we used “,”).

In the local file system, produce an input file pupil_data.txt containing data as shown below.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

Now, move the file from the local file system to HDFS uperform put command as shown below. (You can use duplicateFromLocal command as well.)

$ cd $HADOOP_HOME/bin 
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/pupil_data.txt dfs://localhost:9000/pig_data/

Verifying the file

You can use the cat command to verify whether the file has been moved into the HDFS, as shown below.

$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/pupil_data.txt

Output

You can see the content of the file as shown below.

15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your own own platform... uperform builtin-java clbumes where applicable
  
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai

The Load Operator

You can load data into Apache Pig from the file system (HDFS/ Local) uperform LOAD operator of Pig Latin.

Syntax

The load statement consists of 2 parts divided simply by the “=” operator. On the left-hand part, we need to mention the name of the relation where we want to store the data, and on the correct-hand part, we have to degreat how we store the data. Given below is the syntax of the Load operator.

Relation_name = LOAD 'Input file route' USING function as schema;

Where,

  • relation_name − We have to mention the relation in which we want to store the data.

  • Input file route − We have to mention the HDFS immediateory where the file is stocrimson-coloucrimson. (In MapReduce mode)

  • function − We have to choose a function from the set of load functions provided simply by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).

  • Schema − We have to degreat the schema of the data. We can degreat the requicrimson-coloucrimson schema as follows −

(column1 : data type, column2 : data type, column3 : data type);

Note − We load the data withaway specifying the schema. In that case, the columns will be adgowned as $01, $02, etc… (check).

Example

As an example, allow us load the data in pupil_data.txt in Pig below the schema named Student uperform the LOAD command.

Start the Pig Goperatet Shell

First of all, open up the Linux terminal. Start the Pig Goperatet shell in MapReduce mode as shown below.

$ Pig –x chartcrimson-coloucrimsonuce

It will start the Pig Goperatet shell as shown below.

15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiimmediateed Jun 01 2015, 11:44:35
2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/Hadoop/.pigbootup not found
  
2015-10-01 12:33:39,630 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
 
goperatet>

Execute the Load Statement

Now load the data from the file pupil_data.txt into Pig simply by executing the folloearng Pig Latin statement in the Goperatet shell.

goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_data.txt' 
   USING PigStorage(',')
   as ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, 
   city:chararray );

Folloearng is the description of the above statement.

Relation name We have stocrimson-coloucrimson the data in the schema pupil.
Input file route We are reading data from the file pupil_data.txt, which is in the /pig_data/ immediateory of HDFS.
Storage function We have used the PigStorage() function. It loads and stores data as structucrimson-coloucrimson text files. It conaspectrs a delimiter uperform which every entity of a tuple is separated, as a parameter. By default, it conaspectrs ‘t’ as a parameter.
schema

We have stocrimson-coloucrimson the data uperform the folloearng schema.

column id very initialname finalname phone city
datatype int char array char array char array char array

Note − The load statement will simply load the data into the specified relation in Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators which are discussed in the next chapters.

Apache Pig – Storing Data

In the previous chapter, we find outt how to load data into Apache Pig. You can store the loaded data in the file system uperform the store operator. This chapter exfundamentals how to store data in Apache Pig uperform the Store operator.

Syntax

Given below is the syntax of the Store statement.

STORE Relation_name INTO ' requicrimson-coloucrimson_immediateory_route ' [USING function];

Example

Assume we have a file pupil_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation pupil uperform the LOAD operator as shown below.

goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_data.txt' 
   USING PigStorage(',')
   as ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, 
   city:chararray );

Now, allow us store the relation in the HDFS immediateory “/pig_Output/” as shown below.

goperatet> STORE pupil INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');

Output

After executing the store statement, you will get the folloearng awayput. A immediateory is produced with the specified name and the data will be stocrimson-coloucrimson in it.

2015-10-05 13:05:05,429 [main] INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.
MapReduceLau ncher - 100% compallowe
2015-10-05 13:05:05,429 [main] INFO  org.apache.pig.tools.pigstats.chartcrimson-coloucrimsonuce.SimplePigStats - 
Script Statistics:
   
HadoopVersion    PigVersion    UserId    StartedAt             FinishedAt             Features 
2.6.0            0.15.0        Hadoop    2015-10-0 13:03:03    2015-10-05 13:05:05    UNKNOWN  
Success!  
Job Stats (time in 2nds): 
JobId          Maps    Reduces    MaxMapTime    MinMapTime    AvgMapTime    MedianMapTime    
job_14459_06    1        0           n/a           n/a           n/a           n/a
MaxReduceTime    MinReduceTime    AvgReduceTime    MedianReducetime    Alias    Feature   
     0                 0                0                0             pupil  MAP_ONLY 
OutPut folder
hdfs://localhost:9000/pig_Output/ 
 
Input(s): Successcompallowey read 0 records from: "hdfs://localhost:9000/pig_data/pupil_data.txt"  
Output(s): Successcompallowey stocrimson-coloucrimson 0 records in: "hdfs://localhost:9000/pig_Output"  
Counters:
Total records composed : 0
Total simply bytes composed : 0
Spillable Memory Manager spill count : 0 
Total bags proactionionively spilimmediateed: 0
Total records proactionionively spilimmediateed: 0
  
Job DAG: job_1443519499159_0006
  
2015-10-05 13:06:06,192 [main] INFO  org.apache.pig.backend.hadoop.executionengine
.chartReduceLayer.MapReduceLau ncher - Success!

Verification

You can verify the stocrimson-coloucrimson data as shown below.

Step 1

First of all, list away the files in the immediateory named pig_awayput uperform the ls command as shown below.

hdfs dfs -ls 'hdfs://localhost:9000/pig_Output/'
Found 2 items
rw-r--r-   1 Hadoop supergroup          0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS
rw-r--r-   1 Hadoop supergroup        224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000

You can observe that 2 files were produced after executing the store statement.

Step 2

Uperform cat command, list the contents of the file named part-m-00000 as shown below.

$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Output/part-m-00000' 
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
6,Archana,Mishra,9848022335,Chennai 

Apache Pig – Diagnostic Operators

The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four various types of diagnostic operators −

  • Dump operator
  • Describe operator
  • Explanation operator
  • Illustration operator

In this particular chapter, we will discuss the Dump operators of Pig Latin.

Dump Operator

The Dump operator is used to operate the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.

Syntax

Given below is the syntax of the Dump operator.

goperatet> Dump Relation_Name

Example

Assume we have a file pupil_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation pupil uperform the LOAD operator as shown below.

goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_data.txt' 
   USING PigStorage(',')
   as ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, 
   city:chararray );

Now, allow us print the contents of the relation uperform the Dump operator as shown below.

goperatet> Dump pupil

Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. It will produce the folloearng awayput.

2015-10-01 15:05:27,642 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MapReduceLauncher - 
100% compallowe
2015-10-01 15:05:27,652 [main]
INFO  org.apache.pig.tools.pigstats.chartcrimson-coloucrimsonuce.SimplePigStats - Script Statistics:   
HadoopVersion  PigVersion  UserId    StartedAt             FinishedAt       Features             
2.6.0          0.15.0      Hadoop  2015-10-01 15:03:11  2015-10-01 05:27     UNKNOWN
                                                
Success!  
Job Stats (time in 2nds):
  
JobId           job_14459_0004
Maps                 1  
Reduces              0  
MaxMapTime          n/a    
MinMapTime          n/a
AvgMapTime          n/a 
MedianMapTime       n/a
MaxReduceTime        0
MinReduceTime        0  
AvgReduceTime        0
MedianReducetime     0
Alias             pupil 
Feature           MAP_ONLY        
Outputs           hdfs://localhost:9000/tmp/temp580182027/tmp757878456,

Input(s): Successcompallowey read 0 records from: "hdfs://localhost:9000/pig_data/
pupil_data.txt"
  
Output(s): Successcompallowey stocrimson-coloucrimson 0 records in: "hdfs://localhost:9000/tmp/temp580182027/
tmp757878456"  

Counters: Total records composed : 0 Total simply bytes composed : 0 Spillable Memory Manager 
spill count : 0Total bags proactionionively spilimmediateed: 0 Total records proactionionively spilimmediateed: 0  

Job DAG: job_1443519499159_0004
  
2015-10-01 15:06:28,403 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MapReduceLau ncher - Success!
2015-10-01 15:06:28,441 [main] INFO  org.apache.pig.data.SchemaTupleBackend - 
Key [pig.schematuple] was not set... will not generate code.
2015-10-01 15:06:28,485 [main]
INFO  org.apache.hadoop.chartcrimson-coloucrimsonuce.lib.input.FileInputFormat - Total input routes 
to process : 1
2015-10-01 15:06:28,485 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input routes
to process : 1

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle)
(6,Archana,Mishra,9848022335,Chennai)

Apache Pig – Describe Operator

The describe operator is used to watch the schema of a relation.

Syntax

The syntax of the describe operator is as follows −

goperatet> Describe Relation_name

Example

Assume we have a file pupil_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation pupil uperform the LOAD operator as shown below.

goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_data.txt' USING PigStorage(',')
   as ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Now, allow us describe the relation named pupil and verify the schema as shown below.

goperatet> describe pupil;

Output

Once you execute the above Pig Latin statement, it will produce the folloearng awayput.

goperatet> pupil: { id: int,very initialname: chararray,finalname: chararray,phone: chararray,city: chararray }

Apache Pig – Exfundamental Operator

The exfundamental operator is used to display the logical, physical, and MapReduce execution plans of a relation.

Syntax

Given below is the syntax of the exfundamental operator.

goperatet> exfundamental Relation_name;

Example

Assume we have a file pupil_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation pupil uperform the LOAD operator as shown below.

goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_data.txt' USING PigStorage(',')
   as ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Now, allow us exfundamental the relation named pupil uperform the exfundamental operator as shown below.

goperatet> exfundamental pupil;

Output

It will produce the folloearng awayput.

$ exfundamental pupil;

2015-10-05 11:32:43,660 [main]
2015-10-05 11:32:43,660 [main] INFO  org.apache.pig.brand brand newplan.logical.optimizer
.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPoperatee, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, 
MergeForEach, PartitionFilterOptimizer, Pcrimson-coloucrimsonicatePushdownOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}  
#-----------------------------------------------
# New Logical Plan: 
#-----------------------------------------------
pupil: (Name: LOStore Schema:
id#31:int,very initialname#32:chararray,finalname#33:chararray,phone#34:chararray,city#
35:chararray)
| 
|---pupil: (Name: LOForEach Schema:
id#31:int,very initialname#32:chararray,finalname#33:chararray,phone#34:chararray,city#
35:chararray)
    |   |
    |   (Name: LOGenerate[false,false,false,false,false] Schema:
id#31:int,very initialname#32:chararray,finalname#33:chararray,phone#34:chararray,city#
35:chararray)ColumnPoperatee:InputUids=[34, 35, 32, 33,
31]ColumnPoperatee:OutputUids=[34, 35, 32, 33, 31]
    |   |   | 
    |   |   (Name: Cast Type: int Uid: 31) 
    |   |   |     |   |   |---id:(Name: Project Type: simply bytearray Uid: 31 Input: 0 Column: (*))
    |   |   |     
    |   |   (Name: Cast Type: chararray Uid: 32)
    |   |   | 
    |   |   |---very initialname:(Name: Project Type: simply bytearray Uid: 32 Input: 1
Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 33)
    |   |   |
    |   |   |---finalname:(Name: Project Type: simply bytearray Uid: 33 Input: 2
	 Column: (*))
    |   |   | 
    |   |   (Name: Cast Type: chararray Uid: 34)
    |   |   |  
    |   |   |---phone:(Name: Project Type: simply bytearray Uid: 34 Input: 3 Column:
(*))
    |   |   | 
    |   |   (Name: Cast Type: chararray Uid: 35)
    |   |   |  
    |   |   |---city:(Name: Project Type: simply bytearray Uid: 35 Input: 4 Column:
(*))
    |   | 
    |   |---(Name: LOInnerLoad[0] Schema: id#31:simply bytearray)
    |   |  
    |   |---(Name: LOInnerLoad[1] Schema: very initialname#32:simply bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: finalname#33:simply bytearray)
    |   |
    |   |---(Name: LOInnerLoad[3] Schema: phone#34:simply bytearray)
    |   | 
    |   |---(Name: LOInnerLoad[4] Schema: city#35:simply bytearray)
    |
    |---pupil: (Name: LOLoad Schema: 
id#31:simply bytearray,very initialname#32:simply bytearray,finalname#33:simply bytearray,phone#34:simply bytearray
,city#35:simply bytearray)Requicrimson-coloucrimsonFields:null 
#-----------------------------------------------
# Physical Plan: #-----------------------------------------------
pupil: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
| 
|---pupil: New For Each(false,false,false,false,false)[bag] - scope-35
    |   |
    |   Cast[int] - scope-21
    |   |
    |   |---Project[simply bytearray][0] - scope-20
    |   |  
    |   Cast[chararray] - scope-24
    |   |
    |   |---Project[simply bytearray][1] - scope-23
    |   | 
    |   Cast[chararray] - scope-27
    |   |  
    |   |---Project[simply bytearray][2] - scope-26 
    |   |  
    |   Cast[chararray] - scope-30 
    |   |  
    |   |---Project[simply bytearray][3] - scope-29
    |   |
    |   Cast[chararray] - scope-33
    |   | 
    |   |---Project[simply bytearray][4] - scope-32
    | 
    |---pupil: Load(hdfs://localhost:9000/pig_data/pupil_data.txt:PigStorage(',')) - scope19
2015-10-05 11:32:43,682 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MRCompiler - 
File concatenation threshold: 100 optimistic? false
2015-10-05 11:32:43,684 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MultiQueryOp timizer - 
MR plan dimension before optimization: 1 2015-10-05 11:32:43,685 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.
MultiQueryOp timizer - MR plan dimension after optimization: 1 
#--------------------------------------------------
# Map Reduce Plan                                   
#--------------------------------------------------
MapReduce node scope-37
Map Plan
pupil: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---pupil: New For Each(false,false,false,false,false)[bag] - scope-35
    |   |
    |   Cast[int] - scope-21 
    |   |
    |   |---Project[simply bytearray][0] - scope-20
    |   |
    |   Cast[chararray] - scope-24
    |   |
    |   |---Project[simply bytearray][1] - scope-23
    |   |
    |   Cast[chararray] - scope-27
    |   | 
    |   |---Project[simply bytearray][2] - scope-26 
    |   | 
    |   Cast[chararray] - scope-30 
    |   |  
    |   |---Project[simply bytearray][3] - scope-29 
    |   | 
    |   Cast[chararray] - scope-33
    |   | 
    |   |---Project[simply bytearray][4] - scope-32 
    |  
    |---pupil:
Load(hdfs://localhost:9000/pig_data/pupil_data.txt:PigStorage(',')) - scope
19-------- Global sort: false
 ---------------- 

Apache Pig – Illustrate Operator

The illustrate operator gives you the step-simply by-step execution of a sequence of statements.

Syntax

Given below is the syntax of the illustrate operator.

goperatet> illustrate Relation_name;

Example

Assume we have a file pupil_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune 
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation pupil uperform the LOAD operator as shown below.

goperatet> pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_data.txt' USING PigStorage(',')
   as ( id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Now, allow us illustrate the relation named pupil as shown below.

goperatet> illustrate pupil;

Output

On executing the above statement, you will get the folloearng awayput.

goperatet> illustrate pupil;

INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.PigMapOnly$M ap - Aliases
being processed per job phase (AliasName[series,away fromset]): M: pupil[1,10] C:  R:
---------------------------------------------------------------------------------------------
|pupil | id:int | very initialname:chararray | finalname:chararray | phone:chararray | city:chararray |
--------------------------------------------------------------------------------------------- 
|        | 002    | siddarth            | Battacharya        | 9848022338      | Kolkata        |
---------------------------------------------------------------------------------------------

Apache Pig – Group Operator

The GROUP operator is used to group the data in one or more relations. It collects the data having the exaction same key.

Syntax

Given below is the syntax of the group operator.

goperatet> Group_data = GROUP Relation_name BY age;

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular file into Apache Pig with the relation name pupil_details as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray);

Now, allow us group the records/tuples in the relation simply by age as shown below.

goperatet> group_data = GROUP pupil_details simply by age;

Verification

Verify the relation group_data uperform the DUMP operator as shown below.

goperatet> Dump group_data;

Output

Then you will get awayput displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has 2 columns −

  • One is age, simply by which we have grouped the relation.

  • The other is a bag, which contains the group of tuples, pupil records with the respective age.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshbattle)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})

You can see the schema of the table after grouping the data uperform the describe command as shown below.

goperatet> Describe group_data;
  
group_data: {group: int,pupil_details: {(id: int,very initialname: chararray,
               finalname: chararray,age: int,phone: chararray,city: chararray)}}

In the exaction same way, you can get the sample illustration of the schema uperform the illustrate command as shown below.

$ Illustrate group_data;

It will produce the folloearng awayput −

------------------------------------------------------------------------------------------------- 
|group_data|  group:int | pupil_details:bag{:tuple(id:int,very initialname:chararray,finalname:chararray,age:int,phone:chararray,city:chararray)}|
------------------------------------------------------------------------------------------------- 
|          |     21     | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}| 
|          |     2      | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}| 
-------------------------------------------------------------------------------------------------

Grouping simply by Multiple Columns

Let us group the relation simply by age and city as shown below.

goperatet> group_multiple = GROUP pupil_details simply by (age, city);

You can verify the content of the relation named group_multiple uperform the Dump operator as shown below.

goperatet> Dump group_multiple; 
  
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshbattle),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})

Group All

You can group a relation simply by all the columns as shown below.

goperatet> group_all = GROUP pupil_details All;

Now, verify the content of the relation group_all as shown below.

goperatet> Dump group_all;  
  
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram), 
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshbattle), 
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi), 
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)})

Apache Pig – Cogroup Operator

The COGROUP operator works more or less in the exaction same way as the GROUP operator. The only difference between the 2 operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving 2 or more relations.

Grouping Two Relations uperform Cogroup

Assume that we have 2 files namely pupil_details.txt and employee_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

employee_details.txt

001,Robin,22,brand brand newyork 
002,BOB,23,Kolkata 
003,Maya,23,Tokyo 
004,Sara,25,London 
005,David,23,Bhuwaneshbattle 
006,Maggy,22,Chennai

And we have loaded these files into Pig with the relation names pupil_details and employee_details respectively, as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray); 
  
goperatet> employee_details = LOAD 'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

Now, allow us group the records/tuples of the relations pupil_details and employee_details with the key age, as shown below.

goperatet> cogroup_data = COGROUP pupil_details simply by age, employee_details simply by age;

Verification

Verify the relation cogroup_data uperform the DUMP operator as shown below.

goperatet> Dump cogroup_data;

Output

It will produce the folloearng awayput, displaying the contents of the relation named cogroup_data as shown below.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, 
   {    })  
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },  
   { (6,Maggy,22,Chennai),(1,Robin,22,brand brand newyork) })  
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshbattle)}, 
   {(5,David,23,Bhuwaneshbattle),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) 
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, 
   { })  
(25,{   }, 
   {(4,Sara,25,London)})

The cogroup operator groups the tuples from every relation according to age where every group depicts a particular age value.

For example, if we conpartr the 1st tuple of the result, it is grouped simply by age 21. And it contains 2 bags −

  • the very initial bag holds all the tuples from the very initial relation (pupil_details in this particular case) having age 21, and

  • the 2nd bag contains all the tuples from the 2nd relation (employee_details in this particular case) having age 21.

In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.

Apache Pig – Join Operator

The JOIN operator is used to combine records from 2 or more relations. While performing a sign up for operation, we declare one (or a group of) tuple(s) from every relation, as keys. When these keys go with, the 2 particular tuples are go withed, else the records are fallped. Joins can be of the folloearng types −

  • Self-sign up for
  • Inner-sign up for
  • Outer-sign up for − left sign up for, correct sign up for, and compallowe sign up for

This chapter exfundamentals with examples how to use the sign up for operator in Pig Latin. Assume that we have 2 files namely customers.txt and orders.txt in the /pig_data/ immediateory of HDFS as shown below.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these 2 files into Pig with the relations customers and orders as shown below.

goperatet> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adgown:chararray, salary:int);
  
goperatet> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Let us now perform various Join operations on these 2 relations.

Self – sign up for

Self-sign up for is used to sign up for a table with it iself as if the table were 2 relations, temporarily renaming at minimum one relation.

Generally, in Apache Pig, to perform self-sign up for, we will load the exaction same data multiple times, below various aliases (names). Therefore allow us load the contents of the file customers.txt as 2 tables as shown below.

goperatet> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adgown:chararray, salary:int);
  
goperatet> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adgown:chararray, salary:int); 

Syntax

Given below is the syntax of performing self-sign up for operation uperform the JOIN operator.

goperatet> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

Example

Let us perform self-sign up for operation on the relation customers, simply by sign up foring the 2 relations customers1 and customers2 as shown below.

goperatet> customers3 = JOIN customers1 BY id, customers2 BY id;

Verification

Verify the relation customers3 uperform the DUMP operator as shown below.

goperatet> Dump customers3;

Output

It will produce the folloearng awayput, displaying the contents of the relation customers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join

Inner Join is used very regularly; it is furthermore refercrimson-coloucrimson to as equisign up for. An internal sign up for returns lines when generally there is a go with in both tables.

It produces a brand brand new relation simply by combining column values of 2 relations (say A and B) based upon the sign up for-pcrimson-coloucrimsonicate. The query compares every line of A with every line of B to find all pairs of lines which satisfy the sign up for-pcrimson-coloucrimsonicate. When the sign up for-pcrimson-coloucrimsonicate is satisfied, the column values for every go withed pair of lines of A and B are combined into a result line.

Syntax

Here is the syntax of performing internal sign up for operation uperform the JOIN operator.

goperatet> result = JOIN relation1 BY columnname, relation2 BY columnname;

Example

Let us perform internal sign up for operation on the 2 relations customers and orders as shown below.

goperatet> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Verification

Verify the relation coustomer_orders uperform the DUMP operator as shown below.

goperatet> Dump coustomer_orders;

Output

You will get the folloearng awayput that will the contents of the relation named coustomer_orders.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note

Outer Join: Unlike internal sign up for, awayer sign up for returns all the lines from at minimum one of the relations. An awayer sign up for operation is carried away in 3 ways −

  • Left awayer sign up for
  • Right awayer sign up for
  • Full awayer sign up for

Left Outer Join

The left awayer Join operation returns all lines from the left table, furthermore if generally there are no go withes in the correct relation.

Syntax

Given below is the syntax of performing left awayer sign up for operation uperform the JOIN operator.

goperatet> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

Example

Let us perform left awayer sign up for operation on the 2 relations customers and orders as shown below.

goperatet> awayer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Verification

Verify the relation awayer_left uperform the DUMP operator as shown below.

goperatet> Dump awayer_left;

Output

It will produce the folloearng awayput, displaying the contents of the relation awayer_left.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,) 

Right Outer Join

The correct awayer sign up for operation returns all lines from the correct table, furthermore if generally there are no go withes in the left table.

Syntax

Given below is the syntax of performing correct awayer sign up for operation uperform the JOIN operator.

goperatet> awayer_correct = JOIN customers BY id RIGHT, orders BY customer_id;

Example

Let us perform correct awayer sign up for operation on the 2 relations customers and orders as shown below.

goperatet> awayer_correct = JOIN customers BY id RIGHT, orders BY customer_id;

Verification

Verify the relation awayer_correct uperform the DUMP operator as shown below.

goperatet> Dump awayer_correct

Output

It will produce the folloearng awayput, displaying the contents of the relation awayer_correct.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join

The compallowe awayer sign up for operation returns lines when generally there is a go with in one of the relations.

Syntax

Given below is the syntax of performing compallowe awayer sign up for uperform the JOIN operator.

goperatet> awayer_compallowe = JOIN customers BY id FULL OUTER, orders BY customer_id;

Example

Let us perform compallowe awayer sign up for operation on the 2 relations customers and orders as shown below.

goperatet> awayer_compallowe = JOIN customers BY id FULL OUTER, orders BY customer_id;

Verification

Verify the relation awayer_compallowe uperform the DUMP operator as shown below.

goperate> Dump awayer_compallowe; 

Output

It will produce the folloearng awayput, displaying the contents of the relation awayer_compallowe.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Uperform Multiple Keys

We can perform JOIN operation uperform multiple keys.

Syntax

Here is how you can perform a JOIN operation on 2 tables uperform multiple keys.

goperatet> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);

Assume that we have 2 files namely employee.txt and employee_contactionion.txt in the /pig_data/ immediateory of HDFS as shown below.

employee.txt

001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001

employee_contactionion.txt

001,9848022337,[email protected],Hyderabad,003
002,9848022338,[email protected],Kolkata,003
003,9848022339,[email protected],Delhi,003
004,9848022330,[email protected],Pune,003
005,9848022336,[email protected],Bhuwaneshbattle,003
006,9848022335,[email protected],Chennai,003
007,9848022334,[email protected],trivendram,002
008,9848022333,[email protected],Chennai,001

And we have loaded these 2 files into Pig with relations employee and employee_contactionion as shown below.

goperatet> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray, age:int, designation:chararray, jobid:int);
  
goperatet> employee_contactionion = LOAD 'hdfs://localhost:9000/pig_data/employee_contactionion.txt' USING PigStorage(',') 
   as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

Now, allow us sign up for the contents of these 2 relations uperform the JOIN operator as shown below.

goperatet> emp = JOIN employee BY (id,jobid), employee_contactionion BY (id,jobid);

Verification

Verify the relation emp uperform the DUMP operator as shown below.

goperatet> Dump emp; 

Output

It will produce the folloearng awayput, displaying the contents of the relation named emp as shown below.

(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113)  
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113)  
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113)  
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw ar,113)  
(6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113)  
(7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112)  
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,111)

Apache Pig – Cross Operator

The CROSS operator computes the mix-item of 2 or more relations. This chapter exfundamentals with example how to use the mix operator in Pig Latin.

Syntax

Given below is the syntax of the CROSS operator.

goperatet> Relation3_name = CROSS Relation1_name, Relation2_name;

Example

Assume that we have 2 files namely customers.txt and orders.txt in the /pig_data/ immediateory of HDFS as shown below.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these 2 files into Pig with the relations customers and orders as shown below.

goperatet> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adgown:chararray, salary:int);
  
goperatet> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Let us now get the mix-item of these 2 relations uperform the mix operator on these 2 relations as shown below.

goperatet> mix_data = CROSS customers, orders;

Verification

Verify the relation mix_data uperform the DUMP operator as shown below.

goperatet> Dump mix_data;

Output

It will produce the folloearng awayput, displaying the contents of the relation mix_data.

(7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) 
(7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560) 
(7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) 
(7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000) 
(6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060) 
(6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560) 
(6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500) 
(6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000) 
(5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) 
(5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560) 
(5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) 
(5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000) 
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) 
(4,Chaitali,25,Mumbai,6500,101,2009-20 00:00:00,4,2060) 
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) 
(2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) 
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) 
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) 
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) 
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) 
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560) 
(4,Chaitali,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) 
(4,Chaitali,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000) 
(3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060) 
(3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) 
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) 
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) 
(2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060) 
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) 
(2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) 
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) 
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) 
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) 
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)  

Apache Pig – Union Operator

The UNION operator of Pig Latin is used to merge the content of 2 relations. To perform UNION operation on 2 relations, their own columns and domains must be identical.

Syntax

Given below is the syntax of the UNION operator.

goperatet> Relation_name3 = UNION Relation_name1, Relation_name2;

Example

Assume that we have 2 files namely pupil_data1.txt and pupil_data2.txt in the /pig_data/ immediateory of HDFS as shown below.

Student_data1.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai.

Student_data2.txt

7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai.

And we have loaded these 2 files into Pig with the relations pupil1 and pupil2 as shown below.

goperatet> pupil1 = LOAD 'hdfs://localhost:9000/pig_data/pupil_data1.txt' USING PigStorage(',') 
   as (id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray); 
 
goperatet> pupil2 = LOAD 'hdfs://localhost:9000/pig_data/pupil_data2.txt' USING PigStorage(',') 
   as (id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray);

Let us now merge the contents of these 2 relations uperform the UNION operator as shown below.

goperatet> pupil = UNION pupil1, pupil2;

Verification

Verify the relation pupil uperform the DUMP operator as shown below.

goperatet> Dump pupil; 

Output

It will display the folloearng awayput, displaying the contents of the relation pupil.

(1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune) 
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle)
(6,Archana,Mishra,9848022335,Chennai) 
(7,Komal,Nayak,9848022334,trivendram) 
(8,Bharathi,Nambiayar,9848022333,Chennai)

Apache Pig – Split Operator

The SPLIT operator is used to split a relation into 2 or more relations.

Syntax

Given below is the syntax of the SPLIT operator.

goperatet> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular file into Pig with the relation name pupil_details as shown below.

pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray); 

Let us now split the relation into 2, one listing the employees of age less than 23, and the other listing the employees having the age between 22 and 25.

SPLIT pupil_details into pupil_details1 if age<23, pupil_details2 if (22<age and age>25);

Verification

Verify the relations pupil_details1 and pupil_details2 uperform the DUMP operator as shown below.

goperatet> Dump pupil_details1;  

goperatet> Dump pupil_details2; 

Output

It will produce the folloearng awayput, displaying the contents of the relations pupil_details1 and pupil_details2 respectively.

goperatet> Dump pupil_details1; 
(1,Rajiv,Reddy,21,9848022337,Hyderabad) 
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi) 
(4,Preethi,Agarwal,21,9848022330,Pune)
  
goperatet> Dump pupil_details2; 
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle) 
(6,Archana,Mishra,23,9848022335,Chennai) 
(7,Komal,Nayak,24,9848022334,trivendram) 
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

Apache Pig – Filter Operator

The FILTER operator is used to select the requicrimson-coloucrimson tuples from a relation based on a condition.

Syntax

Given below is the syntax of the FILTER operator.

goperatet> Relation2_name = FILTER Relation1_name BY (condition);

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular file into Pig with the relation name pupil_details as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray);

Let us now use the Filter operator to get the details of the pupils who belengthy to the city Chennai.

filter_data = FILTER pupil_details BY city == 'Chennai';

Verification

Verify the relation filter_data uperform the DUMP operator as shown below.

goperatet> Dump filter_data;

Output

It will produce the folloearng awayput, displaying the contents of the relation filter_data as follows.

(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

Apache Pig – Distinct Operator

The DISTINCT operator is used to remove crimson-coloucrimsonundant (duplicate) tuples from a relation.

Syntax

Given below is the syntax of the DISTINCT operator.

goperatet> Relation_name2 = DISTINCT Relatin_name1;

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 
002,siddarth,Battacharya,9848022338,Kolkata 
003,Rajesh,Khanna,9848022339,Delhi 
003,Rajesh,Khanna,9848022339,Delhi 
004,Preethi,Agarwal,9848022330,Pune 
005,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle
006,Archana,Mishra,9848022335,Chennai 
006,Archana,Mishra,9848022335,Chennai

And we have loaded this particular file into Pig with the relation name pupil_details as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',') 
   as (id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray);

Let us now remove the crimson-coloucrimsonundant (duplicate) tuples from the relation named pupil_details uperform the DISTINCT operator, and store it as another relation named specific_data as shown below.

goperatet> specific_data = DISTINCT pupil_details;

Verification

Verify the relation specific_data uperform the DUMP operator as shown below.

goperatet> Dump specific_data;

Output

It will produce the folloearng awayput, displaying the contents of the relation specific_data as follows.

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata) 
(3,Rajesh,Khanna,9848022339,Delhi) 
(4,Preethi,Agarwal,9848022330,Pune) 
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshbattle)
(6,Archana,Mishra,9848022335,Chennai)

Apache Pig – Forevery Operator

The FOREACH operator is used to generate specified data transformations based on the column data.

Syntax

Given below is the syntax of FOREACH operator.

goperatet> Relation_name2 = FOREACH Relatin_name1 GENERATE (requicrimson-coloucrimson data);

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular file into Pig with the relation name pupil_details as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray,age:int, phone:chararray, city:chararray);

Let us now get the id, age, and city values of every pupil from the relation pupil_details and store it into another relation named forevery_data uperform the forevery operator as shown below.

goperatet> forevery_data = FOREACH pupil_details GENERATE id,age,city;

Verification

Verify the relation forevery_data uperform the DUMP operator as shown below.

goperatet> Dump forevery_data;

Output

It will produce the folloearng awayput, displaying the contents of the relation forevery_data.

(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune) 
(5,23,Bhuwaneshbattle)
(6,23,Chennai) 
(7,24,trivendram)
(8,24,Chennai) 

Apache Pig – Order By

The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.

Syntax

Given below is the syntax of the ORDER BY operator.

goperatet> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular file into Pig with the relation name pupil_details as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray,age:int, phone:chararray, city:chararray);

Let us now sort the relation in a descending order based on the age of the pupil and store it into another relation named order_simply by_data uperform the ORDER BY operator as shown below.

goperatet> order_simply by_data = ORDER pupil_details BY age DESC;

Verification

Verify the relation order_simply by_data uperform the DUMP operator as shown below.

goperatet> Dump order_simply by_data; 

Output

It will produce the folloearng awayput, displaying the contents of the relation order_simply by_data.

(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai) 
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle)
(3,Rajesh,Khanna,22,9848022339,Delhi) 
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune) 
(1,Rajiv,Reddy,21,9848022337,Hyderabad)

Apache Pig – Limit Operator

The LIMIT operator is used to get a limited number of tuples from a relation.

Syntax

Given below is the syntax of the LIMIT operator.

goperatet> Result = LIMIT Relation_name requicrimson-coloucrimson number of tuples;

Example

Assume that we have a file named pupil_details.txt in the HDFS immediateory /pig_data/ as shown below.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular file into Pig with the relation name pupil_details as shown below.

goperatet> pupil_details = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray,age:int, phone:chararray, city:chararray);

Now, allow’s sort the relation in descending order based on the age of the pupil and store it into another relation named limit_data uperform the ORDER BY operator as shown below.

goperatet> limit_data = LIMIT pupil_details 4; 

Verification

Verify the relation limit_data uperform the DUMP operator as shown below.

goperatet> Dump limit_data; 

Output

It will produce the folloearng awayput, displaying the contents of the relation limit_data as follows.

(1,Rajiv,Reddy,21,9848022337,Hyderabad) 
(2,siddarth,Battacharya,22,9848022338,Kolkata) 
(3,Rajesh,Khanna,22,9848022339,Delhi) 
(4,Preethi,Agarwal,21,9848022330,Pune) 

Apache Pig – Eval Functions

Apache Pig provides various built-in functions namely eval, load, store, math, string, bag and tuple functions.

Eval Functions

Given below is the list of eval functions provided simply by Apache Pig.

S.N. Function & Description
1 AVG()

To compute the average of the numerical values wislim a bag.

2 BagToString()

To concatenate the elements of a bag into a string. While concatenating, we can place a delimiter between these values (optional).

3 CONCAT()

To concatenate 2 or more expressions of exaction same type.

4 COUNT()

To get the number of elements in a bag, while counting the number of tuples in a bag.

5 COUNT_STAR()

It is similar to the COUNT() function. It is used to get the number of elements in a bag.

6 DIFF()

To compare 2 bags (fields) in a tuple.

7 IsEmpty()

To check if a bag or chart is empty.

8 MAX()

To calculate the highest value for a column (numeric values or chararrays) in a performle-column bag.

9 MIN()

To get the minimum (lowest) value (numeric or chararray) for a specific column in a performle-column bag.

10 PluckTuple()

Uperform the Pig Latin PluckTuple() function, we can degreat a string Prefix and filter the columns in a relation that start with the given prefix.

11 SIZE()

To compute the number of elements based on any Pig data type.

12 SUBTRACT()

To take awayionion 2 bags. It conaspectrs 2 bags as inputs and returns a bag which contains the tuples of the very initial bag that are not in the 2nd bag.

13 SUM()

To get the total of the numeric values of a column in a performle-column bag.

14 TOKENIZE()

To split a string (which contains a group of words) in a performle tuple and return a bag which contains the awayput of the split operation.

Apache Pig – Load & Store Functions

The Load and Store functions in Apache Pig are used to determine how the data goes ad comes away of Pig. These functions are used with the load and store operators. Given below is the list of load and store functions available in Pig.

S.N. Function & Description
1 PigStorage()

To load and store structucrimson-coloucrimson files.

2 TextLoader()

To load unstructucrimson-coloucrimson data into Pig.

3 BinStorage()

To load and store data into Pig uperform machine readable format.

4 Handling Compression

In Pig Latin, we can load and store compressed data.

Apache Pig – Bag & Tuple Functions

Given below is the list of Bag and Tuple functions.

S.N. Function & Description
1 TOBAG()

To convert 2 or more expressions into a bag.

2 TOP()

To get the top N tuples of a relation.

3 TOTUPLE()

To convert one or more expressions into a tuple.

4 TOMAP()

To convert the key-value pairs into a Map.

Apache Pig – String Functions

We have the folloearng String functions in Apache Pig.

S.N. Functions & Description
1 ENDSWITH(string, checkAgainst)

To verify whether a given string ends with a particular substring.

2 STARTSWITH(string, substring)

Accepts 2 string parameters and verifies whether the very initial string starts with the 2nd.

3 SUBSTRING(string, startIndex, preventIndex)

Returns a substring from a given string.

4 EqualsIgnoreCase(string1, string2)

To compare 2 stings ignoring the case.

5 INDEXOF(string, ‘charactionioner’, startIndex)

Returns the very initial occurrence of a charactionioner in a string, relookuping forbattimmediateed from a start index.

6 LAST_INDEX_OF(expression)

Returns the index of the final occurrence of a charactionioner in a string, relookuping backbattimmediateed from a start index.

7 LCFIRST(expression)

Converts the very initial charactionioner in a string to lower case.

8 UCFIRST(expression)

Returns a string with the very initial charactionioner converted to upper case.

9 UPPER(expression)

UPPER(expression) Returns a string converted to upper case.

10 LOWER(expression)

Converts all charactionioners in a string to lower case.

11 REPLACE(string, ‘oldChar’, ‘brand brand newChar’);

To replace existing charactionioners in a string with brand brand new charactionioners.

12 STRSPLIT(string, regex, limit)

To split a string around go withes of a given regular expression.

13 STRSPLITTOBAG(string, regex, limit)

Similar to the STRSPLIT() function, it split is the string simply by given delimiter and returns the result in a bag.

14 TRIM(expression)

Returns a duplicate of a string with leading and trailing whitespaces removed.

15 LTRIM(expression)

Returns a duplicate of a string with leading whitespaces removed.

16 RTRIM(expression)

Returns a duplicate of a string with trailing whitespaces removed.

Apache Pig – Date-time Functions

Apache Pig provides the folloearng Date and Time functions −

S.N. Functions & Description
1 ToDate(milli2nds)

This function returns a date-time object according to the given parameters. The other alternative for this particular function are ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone)

2 CurrentTime()

returns the date-time object of the current time.

3 GetDay(datetime)

Returns the day of a month from the date-time object.

4 GetHour(datetime)

Returns the hr of a day from the date-time object.

5 GetMilliSecond(datetime)

Returns the milli2nd of a 2nd from the date-time object.

6 GetMinute(datetime)

Returns the moment of an hr from the date-time object.

7 GetMonth(datetime)

Returns the month of a season from the date-time object.

8 GetSecond(datetime)

Returns the 2nd of a moment from the date-time object.

9 GetWeek(datetime)

Returns the week of a season from the date-time object.

10 GetWeekYear(datetime)

Returns the week season from the date-time object.

11 GetYear(datetime)

Returns the season from the date-time object.

12 AddDuration(datetime, duration)

Returns the result of a date-time object alengthy with the duration object.

13 SubtractionionDuration(datetime, duration)

Subtractionions the Duration object from the Date-Time object and returns the result.

14 DaysBetween(datetime1, datetime2)

Returns the number of days between the 2 date-time objects.

15 HoursBetween(datetime1, datetime2)

Returns the number of hrs between 2 date-time objects.

16 MilliSecondsBetween(datetime1, datetime2)

Returns the number of milli2nds between 2 date-time objects.

17 MinutesBetween(datetime1, datetime2)

Returns the number of moments between 2 date-time objects.

18 MonthsBetween(datetime1, datetime2)

Returns the number of months between 2 date-time objects.

19 SecondsBetween(datetime1, datetime2)

Returns the number of 2nds between 2 date-time objects.

20 WeeksBetween(datetime1, datetime2)

Returns the number of weeks between 2 date-time objects.

21 YearsBetween(datetime1, datetime2)

Returns the number of seasons between 2 date-time objects.

Apache Pig – Math Functions

We have the folloearng Math functions in Apache Pig −

S.N. Functions & Description
1 ABS(expression)

To get the absolute value of an expression.

2 ACOS(expression)

To get the arc cosine of an expression.

3 ASIN(expression)

To get the arc sine of an expression.

4 ATAN(expression)

This function is used to get the arc tangent of an expression.

5 CBRT(expression)

This function is used to get the cube main of an expression.

6 CEIL(expression)

This function is used to get the value of an expression rounded up to the nearest integer.

7 COS(expression)

This function is used to get the trigonometric cosine of an expression.

8 COSH(expression)

This function is used to get the hyperbolic cosine of an expression.

9 EXP(expression)

This function is used to get the Euler’s number e raised to the power of x.

10 FLOOR(expression)

To get the value of an expression rounded down to the nearest integer.

11 LOG(expression)

To get the natural logarithm (base e) of an expression.

12 LOG10(expression)

To get the base 10 logarithm of an expression.

13 RANDOM( )

To get a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0.

14 ROUND(expression)

To get the value of an expression rounded to an integer (if the result type is float) or rounded to a lengthy (if the result type is double).

15 SIN(expression)

To get the sine of an expression.

16 SINH(expression)

To get the hyperbolic sine of an expression.

17 SQRT(expression)

To get the posit downive square main of an expression.

18 TAN(expression)

To get the trigonometric tangent of an angle.

19 TANH(expression)

To get the hyperbolic tangent of an expression.

Apache Pig – User Degreatd Functions

In addition to the built-in functions, Apache Pig provides extensive supinterface for User Degreatd Functions (UDF’s). Uperform these UDF’s, we can degreat our own functions and use all of them. The UDF supinterface is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Rusimply by and Groovy.

For writing UDF’s, compallowe supinterface is provided in Java and limited supinterface is provided in all the remaining languages. Uperform Java, you can write UDF’s involving all parts of the procesperform like data load/store, column transformation, and aggregation. Since Apache Pig has been composed in Java, the UDF’s composed uperform Java language work effectively compacrimson-coloucrimson to other languages.

In Apache Pig, we furthermore have a Java reposit downory for UDF’s named Piggybank. Uperform Piggybank, we can access Java UDF’s composed simply by other users, and contribute our own UDF’s.

Types of UDF’s in Java

While writing UDF’s uperform Java, we can produce and use the folloearng 3 types of functions −

  • Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.

  • Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.

  • Algebraic Functions − The Algebraic functions actionion on internal bags in a FOREACHGENERATE statement. These functions are used to perform compallowe MapReduce operations on an internal bag.

Writing UDF’s uperform Java

To write a UDF uperform Java, we have to integrate the jar file Pig-0.15.0.jar. In this particular section, we discuss how to write a sample UDF uperform Eclipse. Before proceeding further, produce sure you have instalimmediateed Eclipse and Maven in your own own system.

Follow the steps given below to write a UDF function −

  • Open Eclipse and produce a brand brand new project (say myproject).

  • Convert the brand brand newly produced project into a Maven project.

  • Copy the folloearng content in the pom.xml. This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.

<project xmlns = "http://maven.apache.org/POM/4.0.0"
   xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> 
	
   <modelVersion>4.0.0</modelVersion> 
   <groupId>Pig_Udf</groupId> 
   <artitruthionId>Pig_Udf</artitruthionId> 
   <version>0.0.1-SNAPSHOT</version>
	
   <construct>    
      <sourceDirectory>src</sourceDirectory>    
      <plugins>      
         <plugin>        
            <artitruthionId>maven-compiler-plugin</artitruthionId>        
            <version>3.3</version>        
            <configuration>          
               <source>1.7</source>          
               <target>1.7</target>        
            </configuration>      
         </plugin>    
      </plugins>  
   </construct>
	
   <dependencies> 
	
      <dependency>            
         <groupId>org.apache.pig</groupId>            
         <artitruthionId>pig</artitruthionId>            
         <version>0.15.0</version>     
      </dependency> 
		
      <dependency>        
         <groupId>org.apache.hadoop</groupId>            
         <artitruthionId>hadoop-core</artitruthionId>            
         <version>0.20.2</version>     
      </dependency> 
      
   </dependencies>  
	
</project>
  • Save the file and refresh it. In the Maven Dependencies section, you can find the downloaded jar files.

  • Create a brand brand new clbum file with name Sample_Eval and duplicate the folloearng content in it.

iminterface java.io.IOException; 
iminterface org.apache.pig.EvalFunc; 
iminterface org.apache.pig.data.Tuple; 
 
iminterface java.io.IOException; 
iminterface org.apache.pig.EvalFunc; 
iminterface org.apache.pig.data.Tuple;

public clbum Sample_Eval extends EvalFunc<String>{ 

   public String exec(Tuple input) thlines IOException {   
      if (input == null || input.dimension() == 0)      
      return null;      
      String str = (String)input.get(0);      
      return str.toUpperCase();  
   } 
}

While writing UDF’s, it is mandatory to inherit the EvalFunc clbum and provide implementation to exec() function. Wislim this particular function, the code requicrimson-coloucrimson for the UDF is composed. In the above example, we have return the code to convert the contents of the given column to uppercase.

  • After compiling the clbum withaway errors, correct-click on the Sample_Eval.java file. It gives you a menu. Select exinterface as shown in the folloearng screenshot.

Select exinterface

  • On clicruler exinterface, you will get the folloearng earndow. Click on JAR file.

Click on Exinterface

  • Proceed further simply by clicruler Next> button. You will get another earndow where you need to enter the route in the local file system, where you need to store the jar file.

jar exinterface

  • Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is produced. This jar file contains the UDF composed in Java.

Uperform the UDF

After writing the UDF and generating the Jar file, follow the steps given below −

Step 1: Registering the Jar file

After writing UDF (in Java) we have to register the Jar file that contain the UDF uperform the Register operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.

Syntax

Given below is the syntax of the Register operator.

REGISTER route; 

Example

As an example allow us register the sample_udf.jar produced earlayr in this particular chapter.

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.

$cd PIG_HOME/bin 
$./pig –x local 

REGISTER '/$PIG_HOME/sample_udf.jar'

Note − bumume the Jar file in the route − /$PIG_HOME/sample_udf.jar

Step 2: Defining Alias

After registering the UDF we can degreat an alias to it uperform the Degreat operator.

Syntax

Given below is the syntax of the Degreat operator.

DEFINE alias {function | [`command` [input] [awayput] [ship] [cache] [stderr] ] }; 

Example

Degreat the alias for sample_eval as shown below.

DEFINE sample_eval sample_eval();

Step 3: Uperform the UDF

After defining the alias you can use the UDF exaction same as the built-in functions. Suppose generally there is a file named emp_data in the HDFS /Pig_Data/ immediateory with the folloearng content.

001,Robin,22,brand brand newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London 
005,David,23,Bhuwaneshbattle 
006,Maggy,22,Chennai
007,Robert,22,brand brand newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London 
011,Stacy,25,Bhuwaneshbattle 
012,Kelly,22,Chennai

And bumume we have loaded this particular file into Pig as shown below.

goperatet> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

Let us now convert the names of the employees in to upper case uperform the UDF sample_eval.

goperatet> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Verify the contents of the relation Upper_case as shown below.

goperatet> Dump Upper_case;
  
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)

Apache Pig – Running Scripts

Here in this particular chapter, we will see how how to operate Apache Pig scripts in batch mode.

Comments in Pig Script

While writing a script in a file, we can include comments in it as shown below.

Multi-series comments

We will start the multi-series comments with '/*', end all of them with '*/'.

/* These are the multi-series comments 
  In the pig script */ 

Single –series comments

We will start the performle-series comments with '–'.

--we can write performle series comments like this particular.

Executing Pig Script in Batch mode

While executing Apache Pig statements in batch mode, follow the steps given below.

Step 1

Write all the requicrimson-coloucrimson Pig Latin statements in a performle file. We can write all the Pig Latin statements and commands in a performle file and save it as .pig file.

Step 2

Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown below.

Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x chartcrimson-coloucrimsonuce Sample_script.pig

You can execute it from the Goperatet shell as well uperform the exec command as shown below.

goperatet> exec /sample_script.pig

Executing a Pig Script from HDFS

We can furthermore execute a Pig script that reparts in the HDFS. Suppose generally there is a Pig script with the name Sample_script.pig in the HDFS immediateory named /pig_data/. We can execute it as shown below.

$ pig -x chartcrimson-coloucrimsonuce hdfs://localhost:9000/pig_data/Sample_script.pig 

Example

Assume we have a file pupil_details.txt in HDFS with the folloearng content.

pupil_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad 
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

We furthermore have a sample script with the name sample_script.pig, in the exaction same HDFS immediateory. This file contains statements performing operations and transformations on the pupil relation, as shown below.

pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING PigStorage(',')
   as (id:int, very initialname:chararray, finalname:chararray, phone:chararray, city:chararray);
	
pupil_order = ORDER pupil BY age DESC;
  
pupil_limit = LIMIT pupil_order 4;
  
Dump pupil_limit;
  • The very initial statement of the script will load the data in the file named pupil_details.txt as a relation named pupil.

  • The 2nd statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as pupil_order.

  • The third statement of the script will store the very initial 4 tuples of pupil_order as pupil_limit.

  • Finally the fourth statement will dump the content of the relation pupil_limit.

Let us now execute the sample_script.pig as shown below.

$./pig -x chartcrimson-coloucrimsonuce hdfs://localhost:9000/pig_data/sample_script.pig

Apache Pig gets executed and gives you the awayput with the folloearng content.

(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai) 
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshbattle) 
(6,Archana,Mishra,23,9848022335,Chennai)
2015-10-19 10:31:27,446 [main] INFO  org.apache.pig.Main - Pig script compallowed in 12
moments, 32 2nds and 751 milli2nds (752751 ms)
SHARE
Previous articleQUnit
Next articleFortran

NO COMMENTS

LEAVE A REPLY