Apache Pig

0
119

Apache Pig – Overwatch

Wmind use is Apache Pig?

Apache Pig is an abstrworkion over MapReduce. It is a tool/platform which is used to analyze huger sets of data representing all of them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop uperform Apache Pig.

To write data analysis programs, Pig provides a high-level language belowstandn as Pig Latin. This language provides various operators uperform which programmers can produce their particular own own functions for reading, writing, and procesperform data.

To analyze data uperform Apache Pig, programmers need to write scripts uperform Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component belowstandn as Pig Engine tmind use accepts the Pig Latin scripts as inplace and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle functioning with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.

  • Uperform Pig Latin, programmers can perform MapReduce tasks easily withaway there having to kind complex codes in Java.

  • Apache Pig uses multi-query approach, there’simply simply by reddish coloured-coloureddish coloureducing the dimension of codes. For example, an operation tmind use would require you to kind 200 seriess of code (LoC) in Java can end up being easily done simply simply by typing as less as simply 10 LoC in Apache Pig. Ultimately Apache Pig reddish coloured-coloureddish coloureduces the producement time simply simply by almany 16 times.

  • Pig Latin is SQL-like language and it is easy to find out Apache Pig when you are familiar with SQL.

  • Apache Pig provides many built-in operators to supinterface data operations like sign up fors, filters, ordering, etc. In addition, it furthermore provides nested data kinds like tuples, bags, and charts tmind use are misperform from MapReduce.

Features of Pig

Apache Pig comes with the folloearng features −

  • Rich set of operators − It provides many operators to perform operations like sign up for, sort, filer, etc.

  • Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.

  • Optimization opinterfacedeviceies − The tasks in Apache Pig optimize their particular own execution automatically, so the programmers need to focus only on semantics of the language.

  • Extensibility − Uperform the existing operators, users can produce their particular own own functions to read, process, and write data.

  • UDF’s − Pig provides the facility to produce User-degoodd Functions in other programming languages such as Java and invoke or emend up beingd all of them in Pig Scripts.

  • Handles all kinds of data − Apache Pig analyzes all kinds of data, both structureddish coloured-coloureddish coloured as well as unstructureddish coloured-coloureddish coloured. It stores the results in HDFS.

Apache Pig Vs MapReduce

Listed end up beinglow are the major differences end up beingtween Apache Pig and MapReduce.

Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data procesperform paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is quite easy. It is quite difficult in MapReduce to perform a Join operation end up beingtween datasets.
Any novice programmer with a fundamental belowstandimmediateadvantage of SQL can work conveniently with Apache Pig. Exposure to Java is must to work with MapReduce.
Apache Pig uses multi-query approach, there’simply simply by reddish coloured-coloureddish coloureducing the dimension of the codes to a great extent. MapReduce will require almany 20 times more the numend up beingr of seriess to perform the same task.
There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. MapReduce jobs have a sizey compilation process.

Apache Pig Vs SQL

Listed end up beinglow are the major differences end up beingtween Apache Pig and SQL.

Pig SQL
Pig Latin is a procedural language. SQL is a declarative language.
In Apache Pig, schema is optional. We can store data withaway there designing a schema (values are storeddish coloured-coloureddish coloured as $01, $02 etc.) Schema is mandatory in SQL.
The data model in Apache Pig is nested relational. The data model used in SQL is flat relational.
Apache Pig provides limited opinterfacedevicey for Query optimization. There is more opinterfacedevicey for query optimization in SQL.

In addition to above differences, Apache Pig Latin −

  • Allows split is in the pipeseries.
  • Allows produceers to store data anywhere in the pipeseries.
  • Declares execution plans.
  • Provides operators to perform ETL (Extrwork, Transform, and Load) functions.

Apache Pig Vs Hive

Both Apache Pig and Hive are used to produce MapReduce jobs. And in a few cases, Hive operates on HDFS in a similar way Apache Pig does. In the folloearng table, we have listed a couple of substantial points tmind use set Apache Pig apart from Hive.

Apache Pig Hive
Apache Pig uses a language calimmediateed Pig Latin. It was uniquely produced at Yahoo. Hive uses a language calimmediateed HiveQL. It was uniquely produced at Facebook.
Pig Latin is a data flow language. HiveQL is a query procesperform language.
Pig Latin is a procedural language and it fit is in pipeseries paradigm. HiveQL is a declarative language.
Apache Pig can handle structureddish coloured-coloureddish coloured, unstructureddish coloured-coloureddish coloured, and semi-structureddish coloured-coloureddish coloured data. Hive is manyly for structureddish coloured-coloureddish coloured data.

Applications of Apache Pig

Apache Pig is generally used simply simply by data scientists for performing tasks involving ad-hoc procesperform and fast prototyping. Apache Pig is used −

  • To process huge data sources such as web logs.
  • To perform data procesperform for oceanrch platforms.
  • To process time sensit downive data loads.

Apache Pig – History

In 2006, Apache Pig was produceed as a reoceanrch project at Yahoo, especially to produce and execute MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced via Apache incubator. In 2008, the initial release of Apache Pig came away there. In 2010, Apache Pig graduated as an Apache top-level project.

Apache Pig – Architecture

The language used to analyze data in Hadoop uperform Pig is belowstandn as Pig Latin. It is a highlevel data procesperform language which provides a wealthy set of data kinds and operators to perform various operations on the data.

To perform a particular task Programmers uperform Pig, programmers need to write a Pig script uperform the Pig Latin language, and execute all of them uperform any of the execution mechanisms (Goperatet Shell, UDFs, Emend up beingdded). After execution, these scripts will go through a series of transformations apprestd simply simply by the Pig Framework, to produce the desireddish coloured-coloureddish coloured away thereplace.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it produces the programmer’s job easy. The architecture of Apache Pig is shown end up beinglow.

Apache Pig Architecture

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us consider a look at the major components.

Parser

Initially the Pig Scripts are handimmediateed simply simply by the Parser. It checks the syntax of the script, does kind checking, and other miscellularaneous checks. The away thereplace of the parser will end up being a DAG (immediateed acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as advantages.

Optimizer

The logical plan (DAG) is moveed to the logical optimizer, which carries away there the logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desireddish coloured-coloureddish coloured results.

Pig Latin Data Model

The data model of Pig Latin is compallowey nested and it permit is complex non-atomic datakinds such as chart and tuple. Given end up beinglow is the diagrammatical representation of Pig Latin’s data model.

Data Model

Atom

Any performle value in Pig Latin, irrespective of their particular own data, kind is belowstandn as an Atom. It is storeddish coloured-coloureddish coloured as string and can end up being used as string and numend up beingr. int, sizey, float, double, chararray, and simply simply bytearray are the atomic values of Pig. A piece of data or a easy atomic value is belowstandn as a field.

Example − ‘raja’ or ‘30’

Tuple

A record tmind use is formed simply simply by an ordereddish coloured-coloureddish coloured set of fields is belowstandn as a tuple, the fields can end up being of any kind. A tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordereddish coloured-coloureddish coloured set of tuples. In other words, a collection of tuples (non-unique) is belowstandn as a bag. Each tuple can have any numend up beingr of fields (flexible schema). A bag is represented simply simply by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not essential tmind use every tuple contain the same numend up beingr of fields or tmind use the fields in the same posit downion (column) have the same kind.

Example − {(Raja, 30), (Mohammad, 45)}

A bag can end up being a field in a relation; in tmind use context, it is belowstandn as internal bag.

Example − {Raja, 30, {9848022338, [email protected],}}

Map

A chart (or data chart) is a set of key-value pairs. The key needs to end up being of kind chararray and need to end up being unique. The value may end up being of any kind. It is represented simply simply by ‘[]’

Example − [name#Raja, age#30]

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordereddish coloured-coloureddish coloured (there is no guarantee tmind use tuples are processed in any particular order).

Apache Pig – Installation

This chapter exfundamentals the how to download, install, and set up Apache Pig in your own system.

Prerequisit downes

It is essential tmind use you have Hadoop and Java instalimmediateed on your own system end up beingfore you go for Apache Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java simply simply by folloearng the steps given in the folloearng link −

/index.php?s=httpwwwtutorialspointcomhadoophadoopenviornmentsetuphtm

Download Apache Pig

First of all, download the lacheck version of Apache Pig from the folloearng websit downe − /index.php?s=httpspigapacheorg

Step 1

Open the homepage of Apache Pig websit downe. Under the section News, click on the link release page as shown in the folloearng snapshot.

Home Page

Step 2

On clicking the specified link, you will end up being reddish coloured-coloureimmediateed to the Apache Pig Releases page. On this particular particular page, below the Download section, you will have 2 links, namely, Pig 0.8 and later and Pig 0.7 and end up beingfore. Click on the link Pig 0.8 and later, then you will end up being reddish coloured-coloureimmediateed to the page having a set of mirrors.

Apache Pig Releases

Step 3

Choose and click any one of these mirrors as shown end up beinglow.

Click Mirrors

Step 4

These mirrors will consider you to the Pig Releases page. This page contains various versions of Apache Pig. Click the lacheck version among all of them.

Pig Release

Step 5

Wislim these folders, you will have the source and binary files of Apache Pig in various distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.

Index

Install Apache Pig

After downloading the Apache Pig smoothware, install it in your own Linux environment simply simply by folloearng the steps given end up beinglow.

Step 1

Create a immediateory with the name Pig in the same immediateory where the installation immediateories of Hadoop, Java, and other smoothware were instalimmediateed. (In our tutorial, we have produced the Pig immediateory in the user named Hadoop).

$ mkdir Pig

Step 2

Extrwork the downloaded tar files as shown end up beinglow.

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz 

Step 3

Move the content of pig-0.15.0-src.tar.gz file to the Pig immediateory produced earrestr as shown end up beinglow.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Configure Apache Pig

After installing Apache Pig, we have to configure it. To configure, we need to edit 2 files − bashrc and pig.properlink ups.

.bashrc file

In the .bashrc file, set the folloearng variables −

  • PIG_HOME folder to the Apache Pig’s installation folder,

  • PATH environment variable to the bin folder, and

  • PIG_CLASSPATH environment variable to the etc (configuration) folder of your own Hadoop installations (the immediateory tmind use contains the core-sit downe.xml, hdfs-sit downe.xml and chartreddish coloured-coloureddish coloured-sit downe.xml files).

exinterface PIG_HOME = /home/Hadoop/Pig
exinterface PATH  = PATH:/home/Hadoop/pig/bin
exinterface PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properlink ups file

In the conf folder of Pig, we have a file named pig.properlink ups. In the pig.properlink ups file, you can set various parameters as given end up beinglow.

pig -h properlink ups 

The folloearng properlink ups are supinterfaceed −

Logging: verbose = true|false; default is false. This house is the same as -v
       switch short=true|false; default is false. This house is the same 
       as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.             
       This house is the same as -d switch aggregate.warning = true|false; default is true. 
       If true, prints count of warnings of every kind instead than logging every warning.		 
		 
Performance tuning: pig.cachedbag.memusage=<mem frworkion>; default is 0.2 (20% of all memory).
       Note tmind use this particular particular memory is shareddish coloured-coloureddish coloured amix all huge bags used simply simply by the application.         
       pig.skewedsign up for.reddish coloured-coloureddish coloureduce.memusagea=<mem frworkion>; default is 0.3 (30% of all memory).
       Specifies the frworkion of heap available for the reddish coloured-coloureddish coloureducer to perform the sign up for.
       pig.exec.nocombiner = true|false; default is false.
           Only disable combiner as a temporary workaround for issues.         
       opt.multiquery = true|false; multiquery is on simply simply by default.
           Only disable multiquery as a temporary workaround for issues.
       opt.fetch=true|false; fetch is on simply simply by default.
           Scripts containing Filter, Forevery, Limit, Stream, and Union can end up being dumped withaway there MR jobs.         
       pig.tmpfilecompression = true|false; compression is away from simply simply by default.             
           Determines whether away thereplace of intermediate jobs is compressed.         
       pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
           Used in conjunction with pig.tmpfilecompression. Degoods compression kind.         
       pig.noSplitCombination = true|false. Split combination is on simply simply by default.
           Determines if multiple small files are combined into a performle chart.         
			  
       pig.exec.chartPartAgg = true|false. Default is false.             
           Determines if partial aggregation is done wislim chart phase, end up beingfore records are sent to combiner.         
       pig.exec.chartPartAgg.minReduction=<min aggregation fworkor>. Default is 10.             
           If the in-chart partial aggregation does not reddish coloured-coloureddish coloureduce the away thereplace num records simply simply by this particular particular fworkor, it gets disabimmediateed.
			  
Miscellularaneous: execkind = chartreddish coloured-coloureddish coloureduce|tez|local; default is chartreddish coloured-coloureddish coloureduce. This house is the same as -x switch
       pig.additional.jars.uris=<comma separated list of jars>. Used in place of register command.
       udf.iminterface.list=<comma separated list of iminterfaces>. Used to avoid package names in UDF.
       prevent.on.failure = true|false; default is false. Set to true to terminate on the initial error.         
       pig.datetime.default.tz=<UTC time away fromset>. e.g. +08:00. Default is the default timezone of the host.
           Determines the timezone used to handle datetime datakind and UDFs.
Additionally, any Hadoop house can end up being specified.

Verifying the Installation

Verify the installation of Apache Pig simply simply by typing the version command. If the installation is successful, you will get the version of Apache Pig as shown end up beinglow.

$ pig –version 
 
Apache Pig version 0.15.0 (r1682971)  
compiimmediateed Jun 01 2015, 11:44:35

Apache Pig – Execution

In the previous chapter, we exfundamentaled how to install Apache Pig. In this particular particular chapter, we will talk abaway how to execute Apache Pig.

Apache Pig Execution Modes

You can operate Apache Pig in 2 modes, namely, Local Mode and HDFS mode.

Local Mode

In this particular particular mode, all the files are instalimmediateed and operate from your own local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for checking purpose.

MapReduce Mode

MapReduce mode is where we load or process the data tmind use exists in the Hadoop File System (HDFS) uperform Apache Pig. In this particular particular mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data tmind use exists in the HDFS.

Apache Pig Execution Mechanisms

Apache Pig scripts can end up being executed in 3 ways, namely, interworkive mode, batch mode, and emend up beingdded mode.

  • Interworkive Mode (Goperatet shell) − You can operate Apache Pig in interworkive mode uperform the Goperatet shell. In this particular particular shell, you can enter the Pig Latin statements and get the away thereplace (uperform Dump operator).

  • Batch Mode (Script) − You can operate Apache Pig in Batch mode simply simply by writing the Pig Latin script in a performle file with .pig extension.

  • Emend up beingdded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Degoodd Functions) in programming languages such as Java, and uperform all of them in our script.

Invoking the Goperatet Shell

You can invoke the Goperatet shell in a desireddish coloured-coloureddish coloured mode (local/MapReduce) uperform the −x option as shown end up beinglow.

Local mode MapReduce mode

Command −

$ ./pig –x local

Command −

$ ./pig -x chartreddish coloured-coloureddish coloureduce

Outplace

Local Mode Outplace

Outplace

MapReduce Mode Outplace

Either of these commands gives you the Goperatet shell prompt as shown end up beinglow.

goperatet>

You can exit the Goperatet shell uperform ‘ctrl + d’.

After invoking the Goperatet shell, you can execute a Pig script simply simply by immediately entering the Pig Latin statements in it.

goperatet> customers = LOAD 'customers.txt' USING PigStorage(',');

Executing Apache Pig in Batch Mode

You can write an entire Pig Latin script in a file and execute it uperform the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as shown end up beinglow.

Sample_script.pig

college student = LOAD 'hdfs://localhost:9000/pig_data/college student.txt' USING
   PigStorage(',') as (id:int,name:chararray,city:chararray);
  
Dump college student;

Now, you can execute the script in the above file as shown end up beinglow.

Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x chartreddish coloured-coloureddish coloureduce Sample_script.pig

Note − We will talk abaway in detail how to operate a Pig script in Bach mode and in emend up beingdded mode in subsequent chapters.

Apache Pig – Goperatet Shell

After invoking the Goperatet shell, you can operate your own Pig scripts in the shell. In addition to tmind use, there are particular helpful shell and utility commands provided simply simply by the Goperatet shell. This chapter exfundamentals the shell and utility commands provided simply simply by the Goperatet shell.

Note − In a few interfaceions of this particular particular chapter, the commands like Load and Store are used. Refer the respective chapters to get in-detail information on all of them.

Shell Commands

The Goperatet shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to tmind use, we can invoke any shell commands uperform sh and fs.

sh Command

Uperform sh command, we can invoke any shell commands from the Goperatet shell. Uperform sh command from the Goperatet shell, we cannot execute the commands tmind use are a part of the shell environment (ex − cd).

Syntax

Given end up beinglow is the syntax of sh command.

goperatet> sh shell command parameters

Example

We can invoke the ls command of Linux shell from the Goperatet shell uperform the sh option as shown end up beinglow. In this particular particular example, it lists away there the files in the /pig/bin/ immediateory.

goperatet> sh ls
   
pig 
pig_1444799121955.log 
pig.cmd 
pig.py

fs Command

Uperform the fs command, we can invoke any FsShell commands from the Goperatet shell.

Syntax

Given end up beinglow is the syntax of fs command.

goperatet> sh File System command parameters

Example

We can invoke the ls command of HDFS from the Goperatet shell uperform fs command. In the folloearng example, it lists the files in the HDFS main immediateory.

goperatet> fs –ls
  
Found 3 items
drwxrwxrwx   - Hadoop supergroup          0 2015-09-08 14:13 Hbase
drwxr-xr-x   - Hadoop supergroup          0 2015-09-09 14:52 seqgen_data
drwxr-xr-x   - Hadoop supergroup          0 2015-09-08 11:30 twitter_data

In the same way, we can invoke all the other file system shell commands from the Goperatet shell uperform the fs command.

Utility Commands

The Goperatet shell provides a set of utility commands. These include utility commands such as clear, help, background, quit, and set; and commands such as exec, destroy, and operate to manage Pig from the Goperatet shell. Given end up beinglow is the description of the utility commands provided simply simply by the Goperatet shell.

clear Command

The clear command is used to clear the screen of the Goperatet shell.

Syntax

You can clear the screen of the goperatet shell uperform the clear command as shown end up beinglow.

goperatet> clear

help Command

The help command gives you a list of Pig commands or Pig properlink ups.

Usage

You can get a list of Pig commands uperform the help command as shown end up beinglow.

goperatet> help

Commands: <pig latin statement>; - See the PigLatin manual for details:
http://hadoop.apache.org/pig
  
File system commands:fs <fs arguments> - Equivalent to Hadoop dfs  command:
http://hadoop.apache.org/common/docs/current/hdfs_shell.html
	 
Diagnostic Commands:descriend up being <alias>[::<alias] - Show the schema for the alias.
Inner aliases can end up being descriend up beingd as A::B.
    exfundamental [-script <pigscript>] [-away there <rawaye>] [-short] [-dot|-xml] 
       [-param <param_name>=<pCram_value>]
       [-param_file <file_name>] [<alias>] - 
       Show the execution plan to complacee the alias or for entire script.
       -script - Exfundamental the entire script.
       -away there - Store the away thereplace into immediateory instead than print to stdaway there.
       -short - Don't expand nested plans (presenting a smaller graph for overwatch).
       -dot - Generate the away thereplace in .dot format. Default is text format.
       -xml - Generate the away thereplace in .xml format. Default is text format.
       -param <param_name - See parameter substitution for details.
       -param_file <file_name> - See parameter substitution for details.
       alias - Alias to exfundamental.
       dump <alias> - Complacee the alias and writes the results to stdaway there.

Utility Commands: exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -
       Execute the script with access to goperatet environment including aliases.
       -param <param_name - See parameter substitution for details.
       -param_file <file_name> - See parameter substitution for details.
       script - Script to end up being executed.
    operate [-param <param_name>=param_value] [-param_file <file_name>] <script> -
       Execute the script with access to goperatet environment.
		 -param <param_name - See parameter substitution for details.         
       -param_file <file_name> - See parameter substitution for details.
       script - Script to end up being executed.
    sh  <shell command> - Invoke a shell command.
    destroy <job_id> - Kill the hadoop job specified simply simply by the hadoop job id.
    set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensit downive.
       The folloearng keys are supinterfaceed:
       default_parallel - Script-level reddish coloured-coloureddish coloureduce parallelism. Basic inplace dimension heuristics used 
       simply simply by default.
       debug - Set debug on or away from. Default is away from.
       job.name - Single-quoted name for jobs. Default is PigLatin:<script name>     
       job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high.
       Default is normal stream.skiprawaye - String tmind use contains the rawaye.
       This is used simply simply by streaming any hadoop house.
    help - Display this particular particular message.
    background [-n] - Display the list statements in cache.
       -n Hide series numend up beingrs.
    quit - Quit the goperatet shell. 

background Command

This command displays a list of statements executed / used so far since the Goperatet sell is invoked.

Usage

Assume we have executed 3 statements since opening the Goperatet shell.

goperatet> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',');
 
goperatet> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
 
goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student.txt' USING PigStorage(',');
 

Then, uperform the background command will produce the folloearng away thereplace.

goperatet> background

customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(','); 
  
orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',');
   
college student = LOAD 'hdfs://localhost:9000/pig_data/college student.txt' USING PigStorage(',');
 

set Command

The set command is used to show/bumign values to keys used in Pig.

Usage

Uperform this particular particular command, you can set values to the folloearng keys.

Key Description and values
default_parallel You can set the numend up beingr of reddish coloured-coloureddish coloureducers for a chart job simply simply by moveing any whole numend up beingr as a value to this particular particular key.
debug You can turn away from or turn on the debugging freature in Pig simply simply by moveing on/away from to this particular particular key.
job.name You can set the Job name to the requireddish coloured-coloureddish coloured job simply simply by moveing a string value to this particular particular key.
job.priority

You can set the job priority to a job simply simply by moveing one of the folloearng values to this particular particular key −

  • very_low
  • low
  • normal
  • high
  • very_high
stream.skiprawaye For streaming, you can set the rawaye from where the data is not to end up being transferreddish coloured-coloureddish coloured, simply simply by moveing the desireddish coloured-coloureddish coloured rawaye in the form of a string to this particular particular key.

quit Command

You can quit from the Goperatet shell uperform this particular particular command.

Usage

Quit from the Goperatet shell as shown end up beinglow.

goperatet> quit

Let us now consider a look at the commands uperform which you can manage Apache Pig from the Goperatet shell.

exec Command

Uperform the exec command, we can execute Pig scripts from the Goperatet shell.

Syntax

Given end up beinglow is the syntax of the utility command exec.

goperatet> exec [–param param_name = param_value] [–param_file file_name] [script]

Example

Let us bumume there is a file named college student.txt in the /pig_data/ immediateory of HDFS with the folloearng content.

Student.txt

001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, bumume we have a script file named sample_script.pig in the /pig_data/ immediateory of HDFS with the folloearng content.

Sample_script.pig

college student = LOAD 'hdfs://localhost:9000/pig_data/college student.txt' USING PigStorage(',') 
   as (id:int,name:chararray,city:chararray);
  
Dump college student;

Now, allow us execute the above script from the Goperatet shell uperform the exec command as shown end up beinglow.

goperatet> exec /sample_script.pig

Outplace

The exec command executes the script in the sample_script.pig. As immediateed in the script, it loads the college student.txt file into Pig and gives you the result of the Dump operator displaying the folloearng content.

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi) 

destroy Command

You can destroy a job from the Goperatet shell uperform this particular particular command.

Syntax

Given end up beinglow is the syntax of the destroy command.

goperatet> destroy JobId

Example

Suppose there is a operatening Pig job having id Id_0055, you can destroy it from the Goperatet shell uperform the destroy command, as shown end up beinglow.

goperatet> destroy Id_0055

operate Command

You can operate a Pig script from the Goperatet shell uperform the operate command

Syntax

Given end up beinglow is the syntax of the operate command.

goperatet> operate [–param param_name = param_value] [–param_file file_name] script

Example

Let us bumume there is a file named college student.txt in the /pig_data/ immediateory of HDFS with the folloearng content.

Student.txt

001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi

And, bumume we have a script file named sample_script.pig in the local filesystem with the folloearng content.

Sample_script.pig

college student = LOAD 'hdfs://localhost:9000/pig_data/college student.txt' USING
   PigStorage(',') as (id:int,name:chararray,city:chararray);

Now, allow us operate the above script from the Goperatet shell uperform the operate command as shown end up beinglow.

goperatet> operate /sample_script.pig

You can see the away thereplace of the script uperform the Dump operator as shown end up beinglow.

goperatet> Dump;

(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)

Note − The difference end up beingtween exec and the operate command is tmind use if we use operate, the statements from the script are available in the command background.

Pig Latin – Basics

Pig Latin is the language used to analyze data in Hadoop uperform Apache Pig. In this particular particular chapter, we are going to talk abaway the fundamentals of Pig Latin such as Pig Latin statements, data kinds, general and relational operators, and Pig Latin UDF’s.

Pig Latin – Data Model

As talk abawayed in the previous chapters, the data model of Pig is compallowey nested. A Relation is the away thereermany structure of the Pig Latin data model. And it is a bag where −

  • A bag is a collection of tuples.
  • A tuple is an ordereddish coloured-coloureddish coloured set of fields.
  • A field is a piece of data.

Pig Latin – Statemets

While procesperform data uperform Pig Latin, statements are the fundamental constructs.

  • These statements work with relations. They include expressions and schemas.

  • Every statement ends with a semicolon (;).

  • We will perform various operations uperform operators provided simply simply by Pig Latin, through statements.

  • Except LOAD and STORE, while performing all other operations, Pig Latin statements consider a relation as inplace and produce an additional relation as away thereplace.

  • As soon as you enter a Load statement in the Goperatet shell, it is semantic checking will end up being carried away there. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will end up being carried away there.

Example

Given end up beinglow is a Pig Latin statement, which loads data to Apache Pig.

goperatet> Student_data = LOAD 'college student_data.txt' USING PigStorage(',')as 
   ( id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Pig Latin – Data kinds

Given end up beinglow table descriend up beings the Pig Latin data kinds.

S.N. Data Type Description & Example
1 int

Represents a signed 32-bit integer.

Example : 8

2 sizey

Represents a signed 64-bit integer.

Example : 5L

3 float

Represents a signed 32-bit floating point.

Example : 5.5F

4 double

Represents a 64-bit floating point.

Example : 10.5

5 chararray

Represents a charworker array (string) in Unicode UTF-8 format.

Example : ‘tutorials point’

6 Bytearray

Represents a Byte array (blob).

7 Boolean

Represents a Boolean value.

Example : true/ false.

8 Datetime

Represents a date-time.

Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger

Represents a Java BigInteger.

Example : 60708090709

10 Bigdecimal

Represents a Java BigDecimal

Example : 185.98376256272893883

Complex Types
11 Tuple

A tuple is an ordereddish coloured-coloureddish coloured set of fields.

Example : (raja, 30)

12 Bag

A bag is a collection of tuples.

Example : {(raju,30),(Mohhammad,45)}

13 Map

A Map is a set of key-value pairs.

Example : [ ‘name’#’Raju’, ‘age’#30]

Null Values

Values for all the above data kinds can end up being NULL. Apache Pig treats null values in a similar way as SQL does.

A null can end up being an unbelowstandn value or a non-existent value. It is used as a placeholder for optional values. These nulls can occur naturally or can end up being the result of an operation.

Pig Latin – Arithmetic Operators

The folloearng table descriend up beings the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20.

Operator Description Example
+

Addition − Adds values on either aspect of the operator

a + b will give 30

Subtrworkion − Subtrworks proper hand operand from left hand operand

a − b will give −10
*

Multiplication − Multiprests values on either aspect of the operator

a * b will give 200
/

Division − Divides left hand operand simply simply by proper hand operand

b / a will give 2
%

Modulus − Divides left hand operand simply simply by proper hand operand and returns remainder

b % a will give 0
? :

Bincond − Evaluates the Boolean operators. It has 3 operands as shown end up beinglow.

variable x = (expression) ? value1 if true : value2 if false.

b = (a == 1)? 20: 30;

if a=1 the value of b is 20.

if a!=1 the value of b is 30.

CASE

WHEN

THEN

ELSE END

Case − The case operator is equivalent to nested bincond operator.

CASE f2 % 2

WHEN 0 THEN 'workually'

WHEN 1 THEN 'odd'

END

Pig Latin – Comparison Operators

The folloearng table descriend up beings the comparison operators of Pig Latin.

Operator Description Example
==

Equal − Checks if the values of 2 operands are equal or not; if yes, then the condition end up beingcomes true.

(a = b) is not true
!=

Not Equal − Checks if the values of 2 operands are equal or not. If the values are not equal, then condition end up beingcomes true.

(a != b) is true.
>

Greater than − Checks if the value of the left operand is greater than the value of the proper operand. If yes, then the condition end up beingcomes true.

(a > b) is not true.
<

Less than − Checks if the value of the left operand is less than the value of the proper operand. If yes, then the condition end up beingcomes true.

(a < b) is true.
>=

Greater than or equal to − Checks if the value of the left operand is greater than or equal to the value of the proper operand. If yes, then the condition end up beingcomes true.

(a >= b) is not true.
<=

Less than or equal to − Checks if the value of the left operand is less than or equal to the value of the proper operand. If yes, then the condition end up beingcomes true.

(a <= b) is true.
fites

Pattern fiting − Checks whether the string in the left-hand aspect fites with the constant in the proper-hand aspect.

f1 fites '.*tutorial.*'

Pig Latin – Type Construction Operators

The folloearng table descriend up beings the Type construction operators of Pig Latin.

Operator Description Example
()

Tuple constructor operator − This operator is used to construct a tuple.

(Raju, 30)
{}

Bag constructor operator − This operator is used to construct a bag.

{(Raju, 30), (Mohammad, 45)}
[]

Map constructor operator − This operator is used to construct a tuple.

[name#Raja, age#30]

Pig Latin – Relational Operations

The folloearng table descriend up beings the relational operators of Pig Latin.

Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
FOREACH, GENERATE To generate data transformations based on columns of data.
STREAM To transform a relation uperform an external program.
Grouping and Joining
JOIN To sign up for 2 or more relations.
COGROUP To group the data in 2 or more relations.
GROUP To group the data in a performle relation.
CROSS To produce the mix product of 2 or more relations.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields (ascending or descending).
LIMIT To get a limited numend up beingr of tuples from a relation.
Combining and Splitting
UNION To combine 2 or more relations into a performle relation.
SPLIT To split a performle relation into 2 or more relations.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To descriend up being the schema of a relation.
EXPLAIN To watch the logical, physical, or MapReduce execution plans to complacee a relation.
ILLUSTRATE To watch the step-simply simply by-step execution of a series of statements.

Apache Pig – Reading Data

In general, Apache Pig works on top of Hadoop. It is an analytical tool tmind use analyzes huge datasets tmind use exist in the Hadoop File System. To analyze data uperform Apache Pig, we have to initially load the data into Apache Pig. This chapter exfundamentals how to load data to Apache Pig from HDFS.

Preparing HDFS

In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, allow us start HDFS and produce the folloearng sample data in HDFS.

Student ID First Name Last Name Phone City
001 Rajiv Reddy 9848022337 Hyderabad
002 siddarth Battacharya 9848022338 Kolkata
003 Rajesh Khanna 9848022339 Delhi
004 Preethi Agarwal 9848022330 Pune
005 Trupthi Mohanthy 9848022336 Bhuwaneshwar
006 Archana Mishra 9848022335 Chennai

The above dataset contains individual details like id, initial name, final name, phone numend up beingr and city, of six college students.

Step 1: Verifying Hadoop

First of all, verify the installation uperform Hadoop version command, as shown end up beinglow.

$ hadoop version

If your own system contains Hadoop, and if you have set the PATH variable, then you will get the folloearng away thereplace −

Hadoop 2.6.0 
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 
Compiimmediateed simply simply by jenkins on 2014-11-13T21:10Z 
Compiimmediateed with protoc 2.5.0 
From source with checksum 18e43357c8f927c0695f1e9522859d6a 
This command was operate uperform /home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar

Step 2: Starting HDFS

Browse through the sbin immediateory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown end up beinglow.

cd /$Hadoop_Home/sbin/ 
$ start-dfs.sh 
localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.away there 
localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.away there 
Starting 2ndary namenodes [0.0.0.0] 
starting 2ndarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoop2ndarynamenode-localhost.localdomain.away there
 
$ start-yarn.sh 
starting yarn daemons 
starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.away there 
localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.away there

Step 3: Create a Directory in HDFS

In Hadoop DFS, you can produce immediateories uperform the command mkdir. Create a brand brand new immediateory in HDFS with the name Pig_Data in the requireddish coloured-coloureddish coloured rawaye as shown end up beinglow.

$cd /$Hadoop_Home/bin/ 
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data 

Step 4: Placing the data in HDFS

The inplace file of Pig contains every tuple/record in individual seriess. And the entilink ups of the record are separated simply simply by a delimiter (In our example we used “,”).

In the local file system, produce an inplace file college student_data.txt containing data as shown end up beinglow.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

Now, move the file from the local file system to HDFS uperform place command as shown end up beinglow. (You can use copyFromLocal command as well.)

$ cd $HADOOP_HOME/bin 
$ hdfs dfs -place /home/Hadoop/Pig/Pig_Data/college student_data.txt dfs://localhost:9000/pig_data/

Verifying the file

You can use the cat command to verify whether the file has end up beingen moved into the HDFS, as shown end up beinglow.

$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/college student_data.txt

Outplace

You can see the content of the file as shown end up beinglow.

15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your own platform... uperform builtin-java clbumes where applicable
  
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai

The Load Operator

You can load data into Apache Pig from the file system (HDFS/ Local) uperform LOAD operator of Pig Latin.

Syntax

The load statement consists of 2 parts divided simply simply by the “=” operator. On the left-hand aspect, we need to mention the name of the relation where we want to store the data, and on the proper-hand aspect, we have to degood how we store the data. Given end up beinglow is the syntax of the Load operator.

Relation_name = LOAD 'Inplace file rawaye' USING function as schema;

Where,

  • relation_name − We have to mention the relation in which we want to store the data.

  • Inplace file rawaye − We have to mention the HDFS immediateory where the file is storeddish coloured-coloureddish coloured. (In MapReduce mode)

  • function − We have to select a function from the set of load functions provided simply simply by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).

  • Schema − We have to degood the schema of the data. We can degood the requireddish coloured-coloureddish coloured schema as follows −

(column1 : data kind, column2 : data kind, column3 : data kind);

Note − We load the data withaway there specifying the schema. In tmind use case, the columns will end up being adawayfited as $01, $02, etc… (check).

Example

As an example, allow us load the data in college student_data.txt in Pig below the schema named Student uperform the LOAD command.

Start the Pig Goperatet Shell

First of all, open the Linux terminal. Start the Pig Goperatet shell in MapReduce mode as shown end up beinglow.

$ Pig –x chartreddish coloured-coloureddish coloureduce

It will start the Pig Goperatet shell as shown end up beinglow.

15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType

2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiimmediateed Jun 01 2015, 11:44:35
2015-10-01 12:33:38,080 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/Hadoop/pig_1443683018078.log
2015-10-01 12:33:38,242 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/Hadoop/.pigbootup not found
  
2015-10-01 12:33:39,630 [main]
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
 
goperatet>

Execute the Load Statement

Now load the data from the file college student_data.txt into Pig simply simply by executing the folloearng Pig Latin statement in the Goperatet shell.

goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student_data.txt' 
   USING PigStorage(',')
   as ( id:int, initialname:chararray, finalname:chararray, phone:chararray, 
   city:chararray );

Folloearng is the description of the above statement.

Relation name We have storeddish coloured-coloureddish coloured the data in the schema college student.
Inplace file rawaye We are reading data from the file college student_data.txt, which is in the /pig_data/ immediateory of HDFS.
Storage function We have used the PigStorage() function. It loads and stores data as structureddish coloured-coloureddish coloured text files. It considers a delimiter uperform which every entity of a tuple is separated, as a parameter. By default, it considers ‘t’ as a parameter.
schema

We have storeddish coloured-coloureddish coloured the data uperform the folloearng schema.

column id initialname finalname phone city
datakind int char array char array char array char array

Note − The load statement will simply load the data into the specified relation in Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators which are talk abawayed in the next chapters.

Apache Pig – Storing Data

In the previous chapter, we find outt how to load data into Apache Pig. You can store the loaded data in the file system uperform the store operator. This chapter exfundamentals how to store data in Apache Pig uperform the Store operator.

Syntax

Given end up beinglow is the syntax of the Store statement.

STORE Relation_name INTO ' requireddish coloured-coloureddish coloured_immediateory_rawaye ' [USING function];

Example

Assume we have a file college student_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation college student uperform the LOAD operator as shown end up beinglow.

goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student_data.txt' 
   USING PigStorage(',')
   as ( id:int, initialname:chararray, finalname:chararray, phone:chararray, 
   city:chararray );

Now, allow us store the relation in the HDFS immediateory “/pig_Outplace/” as shown end up beinglow.

goperatet> STORE college student INTO ' hdfs://localhost:9000/pig_Outplace/ ' USING PigStorage (',');

Outplace

After executing the store statement, you will get the folloearng away thereplace. A immediateory is produced with the specified name and the data will end up being storeddish coloured-coloureddish coloured in it.

2015-10-05 13:05:05,429 [main] INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.
MapReduceLau ncher - 100% compallowe
2015-10-05 13:05:05,429 [main] INFO  org.apache.pig.tools.pigstats.chartreddish coloured-coloureddish coloureduce.SimplePigStats - 
Script Statistics:
   
HadoopVersion    PigVersion    UserId    StartedAt             FinishedAt             Features 
2.6.0            0.15.0        Hadoop    2015-10-0 13:03:03    2015-10-05 13:05:05    UNKNOWN  
Success!  
Job Stats (time in 2nds): 
JobId          Maps    Reduces    MaxMapTime    MinMapTime    AvgMapTime    MedianMapTime    
job_14459_06    1        0           n/a           n/a           n/a           n/a
MaxReduceTime    MinReduceTime    AvgReduceTime    MedianReducetime    Alias    Feature   
     0                 0                0                0             college student  MAP_ONLY 
OutPut folder
hdfs://localhost:9000/pig_Outplace/ 
 
Inplace(s): Successcompallowey read 0 records from: "hdfs://localhost:9000/pig_data/college student_data.txt"  
Outplace(s): Successcompallowey storeddish coloured-coloureddish coloured 0 records in: "hdfs://localhost:9000/pig_Outplace"  
Counters:
Total records maked : 0
Total simply simply bytes maked : 0
Spillable Memory Manager spill count : 0 
Total bags proworkively spilimmediateed: 0
Total records proworkively spilimmediateed: 0
  
Job DAG: job_1443519499159_0006
  
2015-10-05 13:06:06,192 [main] INFO  org.apache.pig.backend.hadoop.executionengine
.chartReduceLayer.MapReduceLau ncher - Success!

Verification

You can verify the storeddish coloured-coloureddish coloured data as shown end up beinglow.

Step 1

First of all, list away there the files in the immediateory named pig_away thereplace uperform the ls command as shown end up beinglow.

hdfs dfs -ls 'hdfs://localhost:9000/pig_Outplace/'
Found 2 items
rw-r--r-   1 Hadoop supergroup          0 2015-10-05 13:03 hdfs://localhost:9000/pig_Outplace/_SUCCESS
rw-r--r-   1 Hadoop supergroup        224 2015-10-05 13:03 hdfs://localhost:9000/pig_Outplace/part-m-00000

You can observe tmind use 2 files were produced after executing the store statement.

Step 2

Uperform cat command, list the contents of the file named part-m-00000 as shown end up beinglow.

$ hdfs dfs -cat 'hdfs://localhost:9000/pig_Outplace/part-m-00000' 
1,Rajiv,Reddy,9848022337,Hyderabad
2,siddarth,Battacharya,9848022338,Kolkata
3,Rajesh,Khanna,9848022339,Delhi
4,Preethi,Agarwal,9848022330,Pune
5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
6,Archana,Mishra,9848022335,Chennai 

Apache Pig – Diagnostic Operators

The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four various kinds of diagnostic operators −

  • Dump operator
  • Descriend up being operator
  • Explanation operator
  • Illustration operator

In this particular particular chapter, we will talk abaway the Dump operators of Pig Latin.

Dump Operator

The Dump operator is used to operate the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.

Syntax

Given end up beinglow is the syntax of the Dump operator.

goperatet> Dump Relation_Name

Example

Assume we have a file college student_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation college student uperform the LOAD operator as shown end up beinglow.

goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student_data.txt' 
   USING PigStorage(',')
   as ( id:int, initialname:chararray, finalname:chararray, phone:chararray, 
   city:chararray );

Now, allow us print the contents of the relation uperform the Dump operator as shown end up beinglow.

goperatet> Dump college student

Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. It will produce the folloearng away thereplace.

2015-10-01 15:05:27,642 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MapReduceLauncher - 
100% compallowe
2015-10-01 15:05:27,652 [main]
INFO  org.apache.pig.tools.pigstats.chartreddish coloured-coloureddish coloureduce.SimplePigStats - Script Statistics:   
HadoopVersion  PigVersion  UserId    StartedAt             FinishedAt       Features             
2.6.0          0.15.0      Hadoop  2015-10-01 15:03:11  2015-10-01 05:27     UNKNOWN
                                                
Success!  
Job Stats (time in 2nds):
  
JobId           job_14459_0004
Maps                 1  
Reduces              0  
MaxMapTime          n/a    
MinMapTime          n/a
AvgMapTime          n/a 
MedianMapTime       n/a
MaxReduceTime        0
MinReduceTime        0  
AvgReduceTime        0
MedianReducetime     0
Alias             college student 
Feature           MAP_ONLY        
Outplaces           hdfs://localhost:9000/tmp/temp580182027/tmp757878456,

Inplace(s): Successcompallowey read 0 records from: "hdfs://localhost:9000/pig_data/
college student_data.txt"
  
Outplace(s): Successcompallowey storeddish coloured-coloureddish coloured 0 records in: "hdfs://localhost:9000/tmp/temp580182027/
tmp757878456"  

Counters: Total records maked : 0 Total simply simply bytes maked : 0 Spillable Memory Manager 
spill count : 0Total bags proworkively spilimmediateed: 0 Total records proworkively spilimmediateed: 0  

Job DAG: job_1443519499159_0004
  
2015-10-01 15:06:28,403 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MapReduceLau ncher - Success!
2015-10-01 15:06:28,441 [main] INFO  org.apache.pig.data.SchemaTupleBackend - 
Key [pig.schematuple] was not set... will not generate code.
2015-10-01 15:06:28,485 [main]
INFO  org.apache.hadoop.chartreddish coloured-coloureddish coloureduce.lib.inplace.FileInplaceFormat - Total inplace rawayes 
to process : 1
2015-10-01 15:06:28,485 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total inplace rawayes
to process : 1

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)

Apache Pig – Descriend up being Operator

The descriend up being operator is used to watch the schema of a relation.

Syntax

The syntax of the descriend up being operator is as follows −

goperatet> Descriend up being Relation_name

Example

Assume we have a file college student_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation college student uperform the LOAD operator as shown end up beinglow.

goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student_data.txt' USING PigStorage(',')
   as ( id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Now, allow us descriend up being the relation named college student and verify the schema as shown end up beinglow.

goperatet> descriend up being college student;

Outplace

Once you execute the above Pig Latin statement, it will produce the folloearng away thereplace.

goperatet> college student: { id: int,initialname: chararray,finalname: chararray,phone: chararray,city: chararray }

Apache Pig – Exfundamental Operator

The exfundamental operator is used to display the logical, physical, and MapReduce execution plans of a relation.

Syntax

Given end up beinglow is the syntax of the exfundamental operator.

goperatet> exfundamental Relation_name;

Example

Assume we have a file college student_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation college student uperform the LOAD operator as shown end up beinglow.

goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student_data.txt' USING PigStorage(',')
   as ( id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Now, allow us exfundamental the relation named college student uperform the exfundamental operator as shown end up beinglow.

goperatet> exfundamental college student;

Outplace

It will produce the folloearng away thereplace.

$ exfundamental college student;

2015-10-05 11:32:43,660 [main]
2015-10-05 11:32:43,660 [main] INFO  org.apache.pig.brand brand newplan.logical.optimizer
.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPoperatee, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, 
MergeForEach, PartitionFilterOptimizer, Preddish coloured-coloureddish colouredicatePushdownOptimizer,
PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]}  
#-----------------------------------------------
# New Logical Plan: 
#-----------------------------------------------
college student: (Name: LOStore Schema:
id#31:int,initialname#32:chararray,finalname#33:chararray,phone#34:chararray,city#
35:chararray)
| 
|---college student: (Name: LOForEach Schema:
id#31:int,initialname#32:chararray,finalname#33:chararray,phone#34:chararray,city#
35:chararray)
    |   |
    |   (Name: LOGenerate[false,false,false,false,false] Schema:
id#31:int,initialname#32:chararray,finalname#33:chararray,phone#34:chararray,city#
35:chararray)ColumnPoperatee:InplaceUids=[34, 35, 32, 33,
31]ColumnPoperatee:OutplaceUids=[34, 35, 32, 33, 31]
    |   |   | 
    |   |   (Name: Cast Type: int Uid: 31) 
    |   |   |     |   |   |---id:(Name: Project Type: simply simply bytearray Uid: 31 Inplace: 0 Column: (*))
    |   |   |     
    |   |   (Name: Cast Type: chararray Uid: 32)
    |   |   | 
    |   |   |---initialname:(Name: Project Type: simply simply bytearray Uid: 32 Inplace: 1
Column: (*))
    |   |   |
    |   |   (Name: Cast Type: chararray Uid: 33)
    |   |   |
    |   |   |---finalname:(Name: Project Type: simply simply bytearray Uid: 33 Inplace: 2
	 Column: (*))
    |   |   | 
    |   |   (Name: Cast Type: chararray Uid: 34)
    |   |   |  
    |   |   |---phone:(Name: Project Type: simply simply bytearray Uid: 34 Inplace: 3 Column:
(*))
    |   |   | 
    |   |   (Name: Cast Type: chararray Uid: 35)
    |   |   |  
    |   |   |---city:(Name: Project Type: simply simply bytearray Uid: 35 Inplace: 4 Column:
(*))
    |   | 
    |   |---(Name: LOInnerLoad[0] Schema: id#31:simply simply bytearray)
    |   |  
    |   |---(Name: LOInnerLoad[1] Schema: initialname#32:simply simply bytearray)
    |   |
    |   |---(Name: LOInnerLoad[2] Schema: finalname#33:simply simply bytearray)
    |   |
    |   |---(Name: LOInnerLoad[3] Schema: phone#34:simply simply bytearray)
    |   | 
    |   |---(Name: LOInnerLoad[4] Schema: city#35:simply simply bytearray)
    |
    |---college student: (Name: LOLoad Schema: 
id#31:simply simply bytearray,initialname#32:simply simply bytearray,finalname#33:simply simply bytearray,phone#34:simply simply bytearray
,city#35:simply simply bytearray)Requireddish coloured-coloureddish colouredFields:null 
#-----------------------------------------------
# Physical Plan: #-----------------------------------------------
college student: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
| 
|---college student: New For Each(false,false,false,false,false)[bag] - scope-35
    |   |
    |   Cast[int] - scope-21
    |   |
    |   |---Project[simply simply bytearray][0] - scope-20
    |   |  
    |   Cast[chararray] - scope-24
    |   |
    |   |---Project[simply simply bytearray][1] - scope-23
    |   | 
    |   Cast[chararray] - scope-27
    |   |  
    |   |---Project[simply simply bytearray][2] - scope-26 
    |   |  
    |   Cast[chararray] - scope-30 
    |   |  
    |   |---Project[simply simply bytearray][3] - scope-29
    |   |
    |   Cast[chararray] - scope-33
    |   | 
    |   |---Project[simply simply bytearray][4] - scope-32
    | 
    |---college student: Load(hdfs://localhost:9000/pig_data/college student_data.txt:PigStorage(',')) - scope19
2015-10-05 11:32:43,682 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MRCompiler - 
File concatenation threshold: 100 optimistic? false
2015-10-05 11:32:43,684 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.MultiQueryOp timizer - 
MR plan dimension end up beingfore optimization: 1 2015-10-05 11:32:43,685 [main]
INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.
MultiQueryOp timizer - MR plan dimension after optimization: 1 
#--------------------------------------------------
# Map Reduce Plan                                   
#--------------------------------------------------
MapReduce node scope-37
Map Plan
college student: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36
|
|---college student: New For Each(false,false,false,false,false)[bag] - scope-35
    |   |
    |   Cast[int] - scope-21 
    |   |
    |   |---Project[simply simply bytearray][0] - scope-20
    |   |
    |   Cast[chararray] - scope-24
    |   |
    |   |---Project[simply simply bytearray][1] - scope-23
    |   |
    |   Cast[chararray] - scope-27
    |   | 
    |   |---Project[simply simply bytearray][2] - scope-26 
    |   | 
    |   Cast[chararray] - scope-30 
    |   |  
    |   |---Project[simply simply bytearray][3] - scope-29 
    |   | 
    |   Cast[chararray] - scope-33
    |   | 
    |   |---Project[simply simply bytearray][4] - scope-32 
    |  
    |---college student:
Load(hdfs://localhost:9000/pig_data/college student_data.txt:PigStorage(',')) - scope
19-------- Global sort: false
 ---------------- 

Apache Pig – Illustrate Operator

The illustrate operator gives you the step-simply simply by-step execution of a sequence of statements.

Syntax

Given end up beinglow is the syntax of the illustrate operator.

goperatet> illustrate Relation_name;

Example

Assume we have a file college student_data.txt in HDFS with the folloearng content.

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune 
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

And we have read it into a relation college student uperform the LOAD operator as shown end up beinglow.

goperatet> college student = LOAD 'hdfs://localhost:9000/pig_data/college student_data.txt' USING PigStorage(',')
   as ( id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray );

Now, allow us illustrate the relation named college student as shown end up beinglow.

goperatet> illustrate college student;

Outplace

On executing the above statement, you will get the folloearng away thereplace.

goperatet> illustrate college student;

INFO  org.apache.pig.backend.hadoop.executionengine.chartReduceLayer.PigMapOnly$M ap - Aliases
end up beinging processed per job phase (AliasName[series,away fromset]): M: college student[1,10] C:  R:
---------------------------------------------------------------------------------------------
|college student | id:int | initialname:chararray | finalname:chararray | phone:chararray | city:chararray |
--------------------------------------------------------------------------------------------- 
|        | 002    | siddarth            | Battacharya        | 9848022338      | Kolkata        |
---------------------------------------------------------------------------------------------

Apache Pig – Group Operator

The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.

Syntax

Given end up beinglow is the syntax of the group operator.

goperatet> Group_data = GROUP Relation_name BY age;

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular particular file into Apache Pig with the relation name college student_details as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray);

Now, allow us group the records/tuples in the relation simply simply by age as shown end up beinglow.

goperatet> group_data = GROUP college student_details simply simply by age;

Verification

Verify the relation group_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump group_data;

Outplace

Then you will get away thereplace displaying the contents of the relation named group_data as shown end up beinglow. Here you can observe tmind use the resulting schema has 2 columns −

  • One is age, simply simply by which we have grouped the relation.

  • The other is a bag, which contains the group of tuples, college student records with the respective age.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})

You can see the schema of the table after grouping the data uperform the descriend up being command as shown end up beinglow.

goperatet> Descriend up being group_data;
  
group_data: {group: int,college student_details: {(id: int,initialname: chararray,
               finalname: chararray,age: int,phone: chararray,city: chararray)}}

In the same way, you can get the sample illustration of the schema uperform the illustrate command as shown end up beinglow.

$ Illustrate group_data;

It will produce the folloearng away thereplace −

------------------------------------------------------------------------------------------------- 
|group_data|  group:int | college student_details:bag{:tuple(id:int,initialname:chararray,finalname:chararray,age:int,phone:chararray,city:chararray)}|
------------------------------------------------------------------------------------------------- 
|          |     21     | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}| 
|          |     2      | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}| 
-------------------------------------------------------------------------------------------------

Grouping simply simply by Multiple Columns

Let us group the relation simply simply by age and city as shown end up beinglow.

goperatet> group_multiple = GROUP college student_details simply simply by (age, city);

You can verify the content of the relation named group_multiple uperform the Dump operator as shown end up beinglow.

goperatet> Dump group_multiple; 
  
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})

Group All

You can group a relation simply simply by all the columns as shown end up beinglow.

goperatet> group_all = GROUP college student_details All;

Now, verify the content of the relation group_all as shown end up beinglow.

goperatet> Dump group_all;  
  
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram), 
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar), 
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi), 
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)})

Apache Pig – Cogroup Operator

The COGROUP operator works more or less in the same way as the GROUP operator. The only difference end up beingtween the 2 operators is tmind use the group operator is normally used with one relation, while the cogroup operator is used in statements involving 2 or more relations.

Grouping Two Relations uperform Cogroup

Assume tmind use we have 2 files namely college student_details.txt and employee_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

employee_details.txt

001,Robin,22,brand brand newyork 
002,BOB,23,Kolkata 
003,Maya,23,Tokyo 
004,Sara,25,London 
005,David,23,Bhuwaneshwar 
006,Maggy,22,Chennai

And we have loaded these files into Pig with the relation names college student_details and employee_details respectively, as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray); 
  
goperatet> employee_details = LOAD 'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

Now, allow us group the records/tuples of the relations college student_details and employee_details with the key age, as shown end up beinglow.

goperatet> cogroup_data = COGROUP college student_details simply simply by age, employee_details simply simply by age;

Verification

Verify the relation cogroup_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump cogroup_data;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation named cogroup_data as shown end up beinglow.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, 
   {    })  
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },  
   { (6,Maggy,22,Chennai),(1,Robin,22,brand brand newyork) })  
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, 
   {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) 
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, 
   { })  
(25,{   }, 
   {(4,Sara,25,London)})

The cogroup operator groups the tuples from every relation according to age where every group depicts a particular age value.

For example, if we conaspectr the 1st tuple of the result, it is grouped simply simply by age 21. And it contains 2 bags −

  • the initial bag holds all the tuples from the initial relation (college student_details in this particular particular case) having age 21, and

  • the 2nd bag contains all the tuples from the 2nd relation (employee_details in this particular particular case) having age 21.

In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.

Apache Pig – Join Operator

The JOIN operator is used to combine records from 2 or more relations. While performing a sign up for operation, we declare one (or a group of) tuple(s) from every relation, as keys. When these keys fit, the 2 particular tuples are fited, else the records are fallped. Joins can end up being of the folloearng kinds −

  • Self-sign up for
  • Inner-sign up for
  • Outer-sign up for − left sign up for, proper sign up for, and compallowe sign up for

This chapter exfundamentals with examples how to use the sign up for operator in Pig Latin. Assume tmind use we have 2 files namely customers.txt and orders.txt in the /pig_data/ immediateory of HDFS as shown end up beinglow.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these 2 files into Pig with the relations customers and orders as shown end up beinglow.

goperatet> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adawayfit:chararray, salary:int);
  
goperatet> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Let us now perform various Join operations on these 2 relations.

Self – sign up for

Self-sign up for is used to sign up for a table with it iself as if the table were 2 relations, temporarily renaming at minimum one relation.

Generally, in Apache Pig, to perform self-sign up for, we will load the same data multiple times, below various aliases (names). Therefore allow us load the contents of the file customers.txt as 2 tables as shown end up beinglow.

goperatet> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adawayfit:chararray, salary:int);
  
goperatet> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adawayfit:chararray, salary:int); 

Syntax

Given end up beinglow is the syntax of performing self-sign up for operation uperform the JOIN operator.

goperatet> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

Example

Let us perform self-sign up for operation on the relation customers, simply simply by sign up foring the 2 relations customers1 and customers2 as shown end up beinglow.

goperatet> customers3 = JOIN customers1 BY id, customers2 BY id;

Verification

Verify the relation customers3 uperform the DUMP operator as shown end up beinglow.

goperatet> Dump customers3;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation customers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join

Inner Join is used quite regularly; it is furthermore referreddish coloured-coloureddish coloured to as equisign up for. An internal sign up for returns rows when there is a fit in both tables.

It produces a brand brand new relation simply simply by combining column values of 2 relations (say A and B) based upon the sign up for-preddish coloured-coloureddish colouredicate. The query compares every row of A with every row of B to find all pairs of rows which satisfy the sign up for-preddish coloured-coloureddish colouredicate. When the sign up for-preddish coloured-coloureddish colouredicate is satisfied, the column values for every fited pair of rows of A and B are combined into a result row.

Syntax

Here is the syntax of performing internal sign up for operation uperform the JOIN operator.

goperatet> result = JOIN relation1 BY columnname, relation2 BY columnname;

Example

Let us perform internal sign up for operation on the 2 relations customers and orders as shown end up beinglow.

goperatet> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Verification

Verify the relation coustomer_orders uperform the DUMP operator as shown end up beinglow.

goperatet> Dump coustomer_orders;

Outplace

You will get the folloearng away thereplace tmind use will the contents of the relation named coustomer_orders.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note

Outer Join: Unlike internal sign up for, away thereer sign up for returns all the rows from at minimum one of the relations. An away thereer sign up for operation is carried away there in 3 ways −

  • Left away thereer sign up for
  • Right away thereer sign up for
  • Full away thereer sign up for

Left Outer Join

The left away thereer Join operation returns all rows from the left table, workually if there are no fites in the proper relation.

Syntax

Given end up beinglow is the syntax of performing left away thereer sign up for operation uperform the JOIN operator.

goperatet> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

Example

Let us perform left away thereer sign up for operation on the 2 relations customers and orders as shown end up beinglow.

goperatet> away thereer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

Verification

Verify the relation away thereer_left uperform the DUMP operator as shown end up beinglow.

goperatet> Dump away thereer_left;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation away thereer_left.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,) 

Right Outer Join

The proper away thereer sign up for operation returns all rows from the proper table, workually if there are no fites in the left table.

Syntax

Given end up beinglow is the syntax of performing proper away thereer sign up for operation uperform the JOIN operator.

goperatet> away thereer_proper = JOIN customers BY id RIGHT, orders BY customer_id;

Example

Let us perform proper away thereer sign up for operation on the 2 relations customers and orders as shown end up beinglow.

goperatet> away thereer_proper = JOIN customers BY id RIGHT, orders BY customer_id;

Verification

Verify the relation away thereer_proper uperform the DUMP operator as shown end up beinglow.

goperatet> Dump away thereer_proper

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation away thereer_proper.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join

The compallowe away thereer sign up for operation returns rows when there is a fit in one of the relations.

Syntax

Given end up beinglow is the syntax of performing compallowe away thereer sign up for uperform the JOIN operator.

goperatet> away thereer_compallowe = JOIN customers BY id FULL OUTER, orders BY customer_id;

Example

Let us perform compallowe away thereer sign up for operation on the 2 relations customers and orders as shown end up beinglow.

goperatet> away thereer_compallowe = JOIN customers BY id FULL OUTER, orders BY customer_id;

Verification

Verify the relation away thereer_compallowe uperform the DUMP operator as shown end up beinglow.

goperate> Dump away thereer_compallowe; 

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation away thereer_compallowe.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Uperform Multiple Keys

We can perform JOIN operation uperform multiple keys.

Syntax

Here is how you can perform a JOIN operation on 2 tables uperform multiple keys.

goperatet> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);

Assume tmind use we have 2 files namely employee.txt and employee_contwork.txt in the /pig_data/ immediateory of HDFS as shown end up beinglow.

employee.txt

001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001

employee_contwork.txt

001,9848022337,[email protected],Hyderabad,003
002,9848022338,[email protected],Kolkata,003
003,9848022339,[email protected],Delhi,003
004,9848022330,[email protected],Pune,003
005,9848022336,[email protected],Bhuwaneshwar,003
006,9848022335,[email protected],Chennai,003
007,9848022334,[email protected],trivendram,002
008,9848022333,[email protected],Chennai,001

And we have loaded these 2 files into Pig with relations employee and employee_contwork as shown end up beinglow.

goperatet> employee = LOAD 'hdfs://localhost:9000/pig_data/employee.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray, age:int, designation:chararray, jobid:int);
  
goperatet> employee_contwork = LOAD 'hdfs://localhost:9000/pig_data/employee_contwork.txt' USING PigStorage(',') 
   as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

Now, allow us sign up for the contents of these 2 relations uperform the JOIN operator as shown end up beinglow.

goperatet> emp = JOIN employee BY (id,jobid), employee_contwork BY (id,jobid);

Verification

Verify the relation emp uperform the DUMP operator as shown end up beinglow.

goperatet> Dump emp; 

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation named emp as shown end up beinglow.

(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113)  
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113)  
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113)  
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw ar,113)  
(6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113)  
(7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112)  
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,111)

Apache Pig – Cross Operator

The CROSS operator complacees the mix-product of 2 or more relations. This chapter exfundamentals with example how to use the mix operator in Pig Latin.

Syntax

Given end up beinglow is the syntax of the CROSS operator.

goperatet> Relation3_name = CROSS Relation1_name, Relation2_name;

Example

Assume tmind use we have 2 files namely customers.txt and orders.txt in the /pig_data/ immediateory of HDFS as shown end up beinglow.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these 2 files into Pig with the relations customers and orders as shown end up beinglow.

goperatet> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, adawayfit:chararray, salary:int);
  
goperatet> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING PigStorage(',')
   as (oid:int, date:chararray, customer_id:int, amount:int);

Let us now get the mix-product of these 2 relations uperform the mix operator on these 2 relations as shown end up beinglow.

goperatet> mix_data = CROSS customers, orders;

Verification

Verify the relation mix_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump mix_data;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation mix_data.

(7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) 
(7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560) 
(7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) 
(7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000) 
(6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060) 
(6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560) 
(6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500) 
(6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000) 
(5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) 
(5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560) 
(5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) 
(5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000) 
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) 
(4,Chaitali,25,Mumbai,6500,101,2009-20 00:00:00,4,2060) 
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) 
(2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) 
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) 
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) 
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) 
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) 
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560) 
(4,Chaitali,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) 
(4,Chaitali,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000) 
(3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060) 
(3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) 
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) 
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) 
(2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060) 
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) 
(2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) 
(1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) 
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) 
(1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) 
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)  

Apache Pig – Union Operator

The UNION operator of Pig Latin is used to merge the content of 2 relations. To perform UNION operation on 2 relations, their particular own columns and domains must end up being identical.

Syntax

Given end up beinglow is the syntax of the UNION operator.

goperatet> Relation_name3 = UNION Relation_name1, Relation_name2;

Example

Assume tmind use we have 2 files namely college student_data1.txt and college student_data2.txt in the /pig_data/ immediateory of HDFS as shown end up beinglow.

Student_data1.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

Student_data2.txt

7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai.

And we have loaded these 2 files into Pig with the relations college student1 and college student2 as shown end up beinglow.

goperatet> college student1 = LOAD 'hdfs://localhost:9000/pig_data/college student_data1.txt' USING PigStorage(',') 
   as (id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray); 
 
goperatet> college student2 = LOAD 'hdfs://localhost:9000/pig_data/college student_data2.txt' USING PigStorage(',') 
   as (id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray);

Let us now merge the contents of these 2 relations uperform the UNION operator as shown end up beinglow.

goperatet> college student = UNION college student1, college student2;

Verification

Verify the relation college student uperform the DUMP operator as shown end up beinglow.

goperatet> Dump college student; 

Outplace

It will display the folloearng away thereplace, displaying the contents of the relation college student.

(1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune) 
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai) 
(7,Komal,Nayak,9848022334,trivendram) 
(8,Bharathi,Nambiayar,9848022333,Chennai)

Apache Pig – Split Operator

The SPLIT operator is used to split a relation into 2 or more relations.

Syntax

Given end up beinglow is the syntax of the SPLIT operator.

goperatet> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular particular file into Pig with the relation name college student_details as shown end up beinglow.

college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray); 

Let us now split the relation into 2, one listing the employees of age less than 23, and the other listing the employees having the age end up beingtween 22 and 25.

SPLIT college student_details into college student_details1 if age<23, college student_details2 if (22<age and age>25);

Verification

Verify the relations college student_details1 and college student_details2 uperform the DUMP operator as shown end up beinglow.

goperatet> Dump college student_details1;  

goperatet> Dump college student_details2; 

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relations college student_details1 and college student_details2 respectively.

goperatet> Dump college student_details1; 
(1,Rajiv,Reddy,21,9848022337,Hyderabad) 
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi) 
(4,Preethi,Agarwal,21,9848022330,Pune)
  
goperatet> Dump college student_details2; 
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) 
(6,Archana,Mishra,23,9848022335,Chennai) 
(7,Komal,Nayak,24,9848022334,trivendram) 
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

Apache Pig – Filter Operator

The FILTER operator is used to select the requireddish coloured-coloureddish coloured tuples from a relation based on a condition.

Syntax

Given end up beinglow is the syntax of the FILTER operator.

goperatet> Relation2_name = FILTER Relation1_name BY (condition);

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular particular file into Pig with the relation name college student_details as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray, age:int, phone:chararray, city:chararray);

Let us now use the Filter operator to get the details of the college students who end up beingsizey to the city Chennai.

filter_data = FILTER college student_details BY city == 'Chennai';

Verification

Verify the relation filter_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump filter_data;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation filter_data as follows.

(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

Apache Pig – Distinct Operator

The DISTINCT operator is used to remove reddish coloured-coloureddish colouredundant (duplicate) tuples from a relation.

Syntax

Given end up beinglow is the syntax of the DISTINCT operator.

goperatet> Relation_name2 = DISTINCT Relatin_name1;

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 
002,siddarth,Battacharya,9848022338,Kolkata 
003,Rajesh,Khanna,9848022339,Delhi 
003,Rajesh,Khanna,9848022339,Delhi 
004,Preethi,Agarwal,9848022330,Pune 
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai 
006,Archana,Mishra,9848022335,Chennai

And we have loaded this particular particular file into Pig with the relation name college student_details as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',') 
   as (id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray);

Let us now remove the reddish coloured-coloureddish colouredundant (duplicate) tuples from the relation named college student_details uperform the DISTINCT operator, and store it as an additional relation named specific_data as shown end up beinglow.

goperatet> specific_data = DISTINCT college student_details;

Verification

Verify the relation specific_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump specific_data;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation specific_data as follows.

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata) 
(3,Rajesh,Khanna,9848022339,Delhi) 
(4,Preethi,Agarwal,9848022330,Pune) 
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)

Apache Pig – Forevery Operator

The FOREACH operator is used to generate specified data transformations based on the column data.

Syntax

Given end up beinglow is the syntax of FOREACH operator.

goperatet> Relation_name2 = FOREACH Relatin_name1 GENERATE (requireddish coloured-coloureddish coloured data);

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular particular file into Pig with the relation name college student_details as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray,age:int, phone:chararray, city:chararray);

Let us now get the id, age, and city values of every college student from the relation college student_details and store it into an additional relation named forevery_data uperform the forevery operator as shown end up beinglow.

goperatet> forevery_data = FOREACH college student_details GENERATE id,age,city;

Verification

Verify the relation forevery_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump forevery_data;

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation forevery_data.

(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune) 
(5,23,Bhuwaneshwar)
(6,23,Chennai) 
(7,24,trivendram)
(8,24,Chennai) 

Apache Pig – Order By

The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.

Syntax

Given end up beinglow is the syntax of the ORDER BY operator.

goperatet> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular particular file into Pig with the relation name college student_details as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray,age:int, phone:chararray, city:chararray);

Let us now sort the relation in a descending order based on the age of the college student and store it into an additional relation named order_simply simply by_data uperform the ORDER BY operator as shown end up beinglow.

goperatet> order_simply simply by_data = ORDER college student_details BY age DESC;

Verification

Verify the relation order_simply simply by_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump order_simply simply by_data; 

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation order_simply simply by_data.

(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai) 
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi) 
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune) 
(1,Rajiv,Reddy,21,9848022337,Hyderabad)

Apache Pig – Limit Operator

The LIMIT operator is used to get a limited numend up beingr of tuples from a relation.

Syntax

Given end up beinglow is the syntax of the LIMIT operator.

goperatet> Result = LIMIT Relation_name requireddish coloured-coloureddish coloured numend up beingr of tuples;

Example

Assume tmind use we have a file named college student_details.txt in the HDFS immediateory /pig_data/ as shown end up beinglow.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this particular particular file into Pig with the relation name college student_details as shown end up beinglow.

goperatet> college student_details = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray,age:int, phone:chararray, city:chararray);

Now, allow’s sort the relation in descending order based on the age of the college student and store it into an additional relation named limit_data uperform the ORDER BY operator as shown end up beinglow.

goperatet> limit_data = LIMIT college student_details 4; 

Verification

Verify the relation limit_data uperform the DUMP operator as shown end up beinglow.

goperatet> Dump limit_data; 

Outplace

It will produce the folloearng away thereplace, displaying the contents of the relation limit_data as follows.

(1,Rajiv,Reddy,21,9848022337,Hyderabad) 
(2,siddarth,Battacharya,22,9848022338,Kolkata) 
(3,Rajesh,Khanna,22,9848022339,Delhi) 
(4,Preethi,Agarwal,21,9848022330,Pune) 

Apache Pig – Eval Functions

Apache Pig provides various built-in functions namely eval, load, store, math, string, bag and tuple functions.

Eval Functions

Given end up beinglow is the list of eval functions provided simply simply by Apache Pig.

S.N. Function & Description
1 AVG()

To complacee the average of the numerical values wislim a bag.

2 BagToString()

To concatenate the elements of a bag into a string. While concatenating, we can place a delimiter end up beingtween these values (optional).

3 CONCAT()

To concatenate 2 or more expressions of same kind.

4 COUNT()

To get the numend up beingr of elements in a bag, while counting the numend up beingr of tuples in a bag.

5 COUNT_STAR()

It is similar to the COUNT() function. It is used to get the numend up beingr of elements in a bag.

6 DIFF()

To compare 2 bags (fields) in a tuple.

7 IsEmpty()

To check if a bag or chart is empty.

8 MAX()

To calculate the highest value for a column (numeric values or chararrays) in a performle-column bag.

9 MIN()

To get the minimum (lowest) value (numeric or chararray) for a particular column in a performle-column bag.

10 PluckTuple()

Uperform the Pig Latin PluckTuple() function, we can degood a string Prefix and filter the columns in a relation tmind use end up beinggin with the given prefix.

11 SIZE()

To complacee the numend up beingr of elements based on any Pig data kind.

12 SUBTRACT()

To subtrwork 2 bags. It considers 2 bags as inplaces and returns a bag which contains the tuples of the initial bag tmind use are not in the 2nd bag.

13 SUM()

To get the compallowe of the numeric values of a column in a performle-column bag.

14 TOKENIZE()

To split a string (which contains a group of words) in a performle tuple and return a bag which contains the away thereplace of the split operation.

Apache Pig – Load & Store Functions

The Load and Store functions in Apache Pig are used to figure out how the data goes ad comes away there of Pig. These functions are used with the load and store operators. Given end up beinglow is the list of load and store functions available in Pig.

S.N. Function & Description
1 PigStorage()

To load and store structureddish coloured-coloureddish coloured files.

2 TextLoader()

To load unstructureddish coloured-coloureddish coloured data into Pig.

3 BinStorage()

To load and store data into Pig uperform machine readable format.

4 Handling Compression

In Pig Latin, we can load and store compressed data.

Apache Pig – Bag & Tuple Functions

Given end up beinglow is the list of Bag and Tuple functions.

S.N. Function & Description
1 TOBAG()

To convert 2 or more expressions into a bag.

2 TOP()

To get the top N tuples of a relation.

3 TOTUPLE()

To convert one or more expressions into a tuple.

4 TOMAP()

To convert the key-value pairs into a Map.

Apache Pig – String Functions

We have the folloearng String functions in Apache Pig.

S.N. Functions & Description
1 ENDSWITH(string, checkAgainst)

To verify whether a given string ends with a particular substring.

2 STARTSWITH(string, substring)

Accepts 2 string parameters and verifies whether the initial string starts with the 2nd.

3 SUBSTRING(string, startIndex, preventIndex)

Returns a substring from a given string.

4 EqualsIgnoreCase(string1, string2)

To compare 2 stings ignoring the case.

5 INDEXOF(string, ‘charworker’, startIndex)

Returns the initial occurrence of a charworker in a string, oceanrching forward from a start index.

6 LAST_INDEX_OF(expression)

Returns the index of the final occurrence of a charworker in a string, oceanrching backward from a start index.

7 LCFIRST(expression)

Converts the initial charworker in a string to lower case.

8 UCFIRST(expression)

Returns a string with the initial charworker converted to upper case.

9 UPPER(expression)

UPPER(expression) Returns a string converted to upper case.

10 LOWER(expression)

Converts all charworkers in a string to lower case.

11 REPLACE(string, ‘oldChar’, ‘brand brand newChar’);

To replace existing charworkers in a string with brand brand new charworkers.

12 STRSPLIT(string, regex, limit)

To split a string around fites of a given regular expression.

13 STRSPLITTOBAG(string, regex, limit)

Similar to the STRSPLIT() function, it split is the string simply simply by given delimiter and returns the result in a bag.

14 TRIM(expression)

Returns a copy of a string with leading and trailing whitespaces removed.

15 LTRIM(expression)

Returns a copy of a string with leading whitespaces removed.

16 RTRIM(expression)

Returns a copy of a string with trailing whitespaces removed.

Apache Pig – Date-time Functions

Apache Pig provides the folloearng Date and Time functions −

S.N. Functions & Description
1 ToDate(milli2nds)

This function returns a date-time object according to the given parameters. The other alternative for this particular particular function are ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone)

2 CurrentTime()

returns the date-time object of the current time.

3 GetDay(datetime)

Returns the day of a month from the date-time object.

4 GetHour(datetime)

Returns the hr of a day from the date-time object.

5 GetMilliSecond(datetime)

Returns the milli2nd of a 2nd from the date-time object.

6 GetMinute(datetime)

Returns the moment of an hr from the date-time object.

7 GetMonth(datetime)

Returns the month of a year from the date-time object.

8 GetSecond(datetime)

Returns the 2nd of a moment from the date-time object.

9 GetWeek(datetime)

Returns the week of a year from the date-time object.

10 GetWeekYear(datetime)

Returns the week year from the date-time object.

11 GetYear(datetime)

Returns the year from the date-time object.

12 AddDuration(datetime, duration)

Returns the result of a date-time object asizey with the duration object.

13 SubtrworkDuration(datetime, duration)

Subtrworks the Duration object from the Date-Time object and returns the result.

14 DaysBetween(datetime1, datetime2)

Returns the numend up beingr of days end up beingtween the 2 date-time objects.

15 HoursBetween(datetime1, datetime2)

Returns the numend up beingr of hrs end up beingtween 2 date-time objects.

16 MilliSecondsBetween(datetime1, datetime2)

Returns the numend up beingr of milli2nds end up beingtween 2 date-time objects.

17 MinutesBetween(datetime1, datetime2)

Returns the numend up beingr of moments end up beingtween 2 date-time objects.

18 MonthsBetween(datetime1, datetime2)

Returns the numend up beingr of months end up beingtween 2 date-time objects.

19 SecondsBetween(datetime1, datetime2)

Returns the numend up beingr of 2nds end up beingtween 2 date-time objects.

20 WeeksBetween(datetime1, datetime2)

Returns the numend up beingr of weeks end up beingtween 2 date-time objects.

21 YearsBetween(datetime1, datetime2)

Returns the numend up beingr of years end up beingtween 2 date-time objects.

Apache Pig – Math Functions

We have the folloearng Math functions in Apache Pig −

S.N. Functions & Description
1 ABS(expression)

To get the absolute value of an expression.

2 ACOS(expression)

To get the arc cosine of an expression.

3 ASIN(expression)

To get the arc sine of an expression.

4 ATAN(expression)

This function is used to get the arc tangent of an expression.

5 CBRT(expression)

This function is used to get the cuend up being main of an expression.

6 CEIL(expression)

This function is used to get the value of an expression rounded up to the nearest integer.

7 COS(expression)

This function is used to get the trigonometric cosine of an expression.

8 COSH(expression)

This function is used to get the hyperbolic cosine of an expression.

9 EXP(expression)

This function is used to get the Euler’s numend up beingr e raised to the power of x.

10 FLOOR(expression)

To get the value of an expression rounded down to the nearest integer.

11 LOG(expression)

To get the natural logarithm (base e) of an expression.

12 LOG10(expression)

To get the base 10 logarithm of an expression.

13 RANDOM( )

To get a pseudo random numend up beingr (kind double) greater than or equal to 0.0 and less than 1.0.

14 ROUND(expression)

To get the value of an expression rounded to an integer (if the result kind is float) or rounded to a sizey (if the result kind is double).

15 SIN(expression)

To get the sine of an expression.

16 SINH(expression)

To get the hyperbolic sine of an expression.

17 SQRT(expression)

To get the posit downive square main of an expression.

18 TAN(expression)

To get the trigonometric tangent of an angle.

19 TANH(expression)

To get the hyperbolic tangent of an expression.

Apache Pig – User Degoodd Functions

In addition to the built-in functions, Apache Pig provides extensive supinterface for User Degoodd Functions (UDF’s). Uperform these UDF’s, we can degood our own functions and use all of them. The UDF supinterface is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Rusimply simply by and Groovy.

For writing UDF’s, compallowe supinterface is provided in Java and limited supinterface is provided in all the remaining languages. Uperform Java, you can write UDF’s involving all parts of the procesperform like data load/store, column transformation, and aggregation. Since Apache Pig has end up beingen maked in Java, the UDF’s maked uperform Java language work effectively compareddish coloured-coloureddish coloured to other languages.

In Apache Pig, we furthermore have a Java reposit downory for UDF’s named Piggybank. Uperform Piggybank, we can access Java UDF’s maked simply simply by other users, and contribute our own UDF’s.

Types of UDF’s in Java

While writing UDF’s uperform Java, we can produce and use the folloearng 3 kinds of functions −

  • Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as inplace and return a Boolean value.

  • Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as inplace and return a Pig result.

  • Algebraic Functions − The Algebraic functions work on internal bags in a FOREACHGENERATE statement. These functions are used to perform compallowe MapReduce operations on an internal bag.

Writing UDF’s uperform Java

To write a UDF uperform Java, we have to integrate the jar file Pig-0.15.0.jar. In this particular particular section, we talk abaway how to write a sample UDF uperform Eclipse. Before proceeding further, produce sure you have instalimmediateed Eclipse and Maven in your own system.

Follow the steps given end up beinglow to write a UDF function −

  • Open Eclipse and produce a brand brand new project (say myproject).

  • Convert the brand brand newly produced project into a Maven project.

  • Copy the folloearng content in the pom.xml. This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.

<project xmlns = "http://maven.apache.org/POM/4.0.0"
   xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> 
	
   <modelVersion>4.0.0</modelVersion> 
   <groupId>Pig_Udf</groupId> 
   <artifworkId>Pig_Udf</artifworkId> 
   <version>0.0.1-SNAPSHOT</version>
	
   <produce>    
      <sourceDirectory>src</sourceDirectory>    
      <plugins>      
         <plugin>        
            <artifworkId>maven-compiler-plugin</artifworkId>        
            <version>3.3</version>        
            <configuration>          
               <source>1.7</source>          
               <target>1.7</target>        
            </configuration>      
         </plugin>    
      </plugins>  
   </produce>
	
   <dependencies> 
	
      <dependency>            
         <groupId>org.apache.pig</groupId>            
         <artifworkId>pig</artifworkId>            
         <version>0.15.0</version>     
      </dependency> 
		
      <dependency>        
         <groupId>org.apache.hadoop</groupId>            
         <artifworkId>hadoop-core</artifworkId>            
         <version>0.20.2</version>     
      </dependency> 
      
   </dependencies>  
	
</project>
  • Save the file and refresh it. In the Maven Dependencies section, you can find the downloaded jar files.

  • Create a brand brand new clbum file with name Sample_Eval and copy the folloearng content in it.

iminterface java.io.IOException; 
iminterface org.apache.pig.EvalFunc; 
iminterface org.apache.pig.data.Tuple; 
 
iminterface java.io.IOException; 
iminterface org.apache.pig.EvalFunc; 
iminterface org.apache.pig.data.Tuple;

public clbum Sample_Eval extends EvalFunc<String>{ 

   public String exec(Tuple inplace) throws IOException {   
      if (inplace == null || inplace.dimension() == 0)      
      return null;      
      String str = (String)inplace.get(0);      
      return str.toUpperCase();  
   } 
}

While writing UDF’s, it is mandatory to inherit the EvalFunc clbum and provide implementation to exec() function. Wislim this particular particular function, the code requireddish coloured-coloureddish coloured for the UDF is maked. In the above example, we have return the code to convert the contents of the given column to uppercase.

  • After compiling the clbum withaway there errors, proper-click on the Sample_Eval.java file. It gives you a menu. Select exinterface as shown in the folloearng screenshot.

Select exinterface

  • On clicking exinterface, you will get the folloearng earndow. Click on JAR file.

Click on Exinterface

  • Proceed further simply simply by clicking Next> button. You will get an additional earndow where you need to enter the rawaye in the local file system, where you need to store the jar file.

jar exinterface

  • Finally click the Finish button. In the specified folder, a Jar file sample_udf.jar is produced. This jar file contains the UDF maked in Java.

Uperform the UDF

After writing the UDF and generating the Jar file, follow the steps given end up beinglow −

Step 1: Registering the Jar file

After writing UDF (in Java) we have to register the Jar file tmind use contain the UDF uperform the Register operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.

Syntax

Given end up beinglow is the syntax of the Register operator.

REGISTER rawaye; 

Example

As an example allow us register the sample_udf.jar produced earrestr in this particular particular chapter.

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown end up beinglow.

$cd PIG_HOME/bin 
$./pig –x local 

REGISTER '/$PIG_HOME/sample_udf.jar'

Note − bumume the Jar file in the rawaye − /$PIG_HOME/sample_udf.jar

Step 2: Defining Alias

After registering the UDF we can degood an alias to it uperform the Degood operator.

Syntax

Given end up beinglow is the syntax of the Degood operator.

DEFINE alias {function | [`command` [inplace] [away thereplace] [ship] [cache] [stderr] ] }; 

Example

Degood the alias for sample_eval as shown end up beinglow.

DEFINE sample_eval sample_eval();

Step 3: Uperform the UDF

After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file named emp_data in the HDFS /Pig_Data/ immediateory with the folloearng content.

001,Robin,22,brand brand newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London 
005,David,23,Bhuwaneshwar 
006,Maggy,22,Chennai
007,Roend up beingrt,22,brand brand newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London 
011,Stacy,25,Bhuwaneshwar 
012,Kelly,22,Chennai

And bumume we have loaded this particular particular file into Pig as shown end up beinglow.

goperatet> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);

Let us now convert the names of the employees in to upper case uperform the UDF sample_eval.

goperatet> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Verify the contents of the relation Upper_case as shown end up beinglow.

goperatet> Dump Upper_case;
  
(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY)
(KELLY)

Apache Pig – Running Scripts

Here in this particular particular chapter, we will see how how to operate Apache Pig scripts in batch mode.

Comments in Pig Script

While writing a script in a file, we can include comments in it as shown end up beinglow.

Multi-series comments

We will end up beinggin the multi-series comments with '/*', end all of them with '*/'.

/* These are the multi-series comments 
  In the pig script */ 

Single –series comments

We will end up beinggin the performle-series comments with '–'.

--we can write performle series comments like this particular particular.

Executing Pig Script in Batch mode

While executing Apache Pig statements in batch mode, follow the steps given end up beinglow.

Step 1

Write all the requireddish coloured-coloureddish coloured Pig Latin statements in a performle file. We can write all the Pig Latin statements and commands in a performle file and save it as .pig file.

Step 2

Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown end up beinglow.

Local mode MapReduce mode
$ pig -x local Sample_script.pig $ pig -x chartreddish coloured-coloureddish coloureduce Sample_script.pig

You can execute it from the Goperatet shell as well uperform the exec command as shown end up beinglow.

goperatet> exec /sample_script.pig

Executing a Pig Script from HDFS

We can furthermore execute a Pig script tmind use reaspects in the HDFS. Suppose there is a Pig script with the name Sample_script.pig in the HDFS immediateory named /pig_data/. We can execute it as shown end up beinglow.

$ pig -x chartreddish coloured-coloureddish coloureduce hdfs://localhost:9000/pig_data/Sample_script.pig 

Example

Assume we have a file college student_details.txt in HDFS with the folloearng content.

college student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad 
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi 
004,Preethi,Agarwal,21,9848022330,Pune 
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 
006,Archana,Mishra,23,9848022335,Chennai 
007,Komal,Nayak,24,9848022334,trivendram 
008,Bharathi,Nambiayar,24,9848022333,Chennai

We furthermore have a sample script with the name sample_script.pig, in the same HDFS immediateory. This file contains statements performing operations and transformations on the college student relation, as shown end up beinglow.

college student = LOAD 'hdfs://localhost:9000/pig_data/college student_details.txt' USING PigStorage(',')
   as (id:int, initialname:chararray, finalname:chararray, phone:chararray, city:chararray);
	
college student_order = ORDER college student BY age DESC;
  
college student_limit = LIMIT college student_order 4;
  
Dump college student_limit;
  • The initial statement of the script will load the data in the file named college student_details.txt as a relation named college student.

  • The 2nd statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as college student_order.

  • The third statement of the script will store the initial 4 tuples of college student_order as college student_limit.

  • Finally the fourth statement will dump the content of the relation college student_limit.

Let us now execute the sample_script.pig as shown end up beinglow.

$./pig -x chartreddish coloured-coloureddish coloureduce hdfs://localhost:9000/pig_data/sample_script.pig

Apache Pig gets executed and gives you the away thereplace with the folloearng content.

(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai) 
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) 
(6,Archana,Mishra,23,9848022335,Chennai)
2015-10-19 10:31:27,446 [main] INFO  org.apache.pig.Main - Pig script compallowed in 12
moments, 32 2nds and 751 milli2nds (752751 ms)

NO COMMENTS

LEAVE A REPLY