HCatalog

0
18

HCatalog – Introduction

What is HCatalog?

HCatalog is a table storage management tool for Hadoop. It exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with various data procesperform tools (Pig, MapReduce) to easily write data onto a grid. It ensures that users don’t have to worry about where or in exconsider take actionionly extake actionly what format their particular own data is stored-coloured-coloured-coloured.

HCatalog works like a key component of Hive and it enables the users to store their particular own data in any format and any structure.

Why HCatalog?

Enabling right tool for right Job

Hadoop ecosystem contains various tools for data procesperform such as Hive, Pig, and MapReduce. Although these tools do not require metadata, they can still becomenefit from it when it is present. Sharing a metadata store also enables users amix tools to share data more easily. A workflow where data is loaded and normalized uperform MapReduce or Pig and then analyzed via Hive is very common. If all these tools share one metastore, then the users of every tool have immediate access to data developd with another tool. No loading or transfer steps are required-coloured-coloured-coloured.

Capture procesperform states to enable sharing

HCatalog can publish your analytics results. So the other programmer can access your analytics platform via “REST”. The schemas which are published simply by you are also helpful to other data scientists. The other data scientists use your discoveries as inplaces into a subsequent discovery.

Integrate Hadoop with everyslimg

Hadoop as a procesperform and storage environment open up ups up a lot of opportdevicey for the enterprise; however, to fuel adoption, it must work with and augment existing tools. Hadoop ought to serve as inplace into your analytics platform or integrate with your operational data stores and web applications. The body organization ought to enjoy the value of Hadoop without having to learn an entirely brand new toolset. REST services open up ups up the platform to the enterprise with a familiar API and SQL-like language. Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform.

HCatalog Architecture

The folloearng illustration shows the general architecture of HCatalog.

Architecture

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can become composed. By default, HCatalog supports RCFile, CSV, JSON, SequenceFile, and ORC file formats. To use a custom format, you must provide the InplaceFormat, OutplaceFormat, and SerDe.

HCatalog is built on top of the Hive metastore and incorporates Hive's DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive's command series interface for issuing data definition and metadata exploration commands.

HCatalog – Installation

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install a Linux flavour on your system. HCatalog is merged with Hive Installation on March 26, 2013. From the version Hive-0.11.0 onbattimmediateeds, HCatalog comes with Hive installation. Therefore, follow the steps given becomelow to install Hive which in turn will automatically install HCatalog on your system.

Step 1: Verifying JAVA Installation

Java must become instalimmediateed on your system becomefore installing Hive. You can use the folloearng command to check whether you have Java already instalimmediateed on your system −

$ java –version

If Java is already instalimmediateed on your system, you get to see the folloearng response −

java version "1.7.0_71"
Java(TM) SE Runtime Environment (develop 1.7.0_71-b13)
Java HotSpot(TM) Crestnt VM (develop 25.0-b02, mixed mode)

If you don’t have Java instalimmediateed on your system, then you need to follow the steps given becomelow.

Step 2: Installing Java

Download Java (JDK <lacheck version> – X64.tar.gz) simply by visit down downing the folloearng link http://www.oracle.com/

Then jdk-7u71-linux-x64.tar.gz will become downloaded onto your system.

Generally you will find the downloaded Java file in the Downloads folder. Verify it and extrconsider take actionion the jdk-7u71-linux-x64.gz file uperform the folloearng commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

To develop Java available to all the users, you have to move it to the location “/usr/local/”. Open main, and kind the folloearng commands.

$ su
pbumword:
# mv jdk1.7.0_71 /usr/local/
# exit

For setting up PATH and JAVA_HOME variables, add the folloearng commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=PATH:$JAVA_HOME/bin

Now verify the installation uperform the command java -version from the terminal as explained above.

Step 3: Verifying Hadoop Installation

Hadoop must become instalimmediateed on your system becomefore installing Hive. Let us verify the Hadoop installation uperform the folloearng command −

$ hadoop version

If Hadoop is already instalimmediateed on your system, then you will get the folloearng response −

Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiimmediateed simply by hortonmu on 2013-10-07T06:28Z
Compiimmediateed with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4

If Hadoop is not instalimmediateed on your system, then proceed with the folloearng steps −

Step 4: Downloading Hadoop

Download and extrconsider take actionion Hadoop 2.4.1 from Apache Softbattlee Foundation uperform the folloearng commands.

$ su
pbumword:
# cd /usr/local
# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/
hadoop-2.4.1.tar.gz
# tar xzf hadoop-2.4.1.tar.gz
# mv hadoop-2.4.1/* to hadoop/
# exit

Step 5: Installing Hadoop in Pseudo Distributed Mode

The folloearng steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Setting up Hadoop

You can set Hadoop environment variables simply by appending the folloearng commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME 
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply all the alters into the current running system.

$ source ~/.bashrc

Hadoop Configuration

You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. You need to develop suitable alters in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs uperform Java, you have to reset the Java environment variables in hadoop-env.sh file simply by replacing JAVA_HOME value with the location of Java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

Given becomelow are the list of files that you have to edit to configure Hadoop.

core-sit down downe.xml

The core-sit down downe.xml file contains information such as the port numbecomer used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and the dimension of Read/Write buffers.

Open the core-sit down downe.xml and add the folloearng properconnects in becometween the <configuration> and </configuration> tags.

<configuration>
   <property or home>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property or home>
</configuration>

hdfs-sit down downe.xml

The hdfs-sit down downe.xml file contains information such as the value of replication data, the namenode route, and the datanode route of your local file systems. It means the place where you like to store the Hadoop infrastructure.

Let us bumume the folloearng data.

dfs.replication (data replication value) = 1

(In the folloearng route /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the immediateory developd simply by hdfs file system.)

namenode route = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the immediateory developd simply by hdfs file system.)
datanode route = //home/hadoop/hadoopinfra/hdfs/datanode

Open this particular file and add the folloearng properconnects in becometween the <configuration>, </configuration> tags in this particular file.

<configuration>
   <property or home>
      <name>dfs.replication</name>
      <value>1</value>
   </property or home> 
   
   <property or home>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> 
   </property or home> 

   <property or home>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> 
   </property or home>
</configuration>

Note − In the above file, all the property or home values are user-degoodd and you can develop alters according to your Hadoop infrastructure.

yarn-sit down downe.xml

This file is used to configure yarn into Hadoop. Open the yarn-sit down downe.xml file and add the folloearng properconnects in becometween the <configuration>, </configuration> tags in this particular file.

<configuration>
   <property or home>
      <name>yarn.nodemanager.aux-services</name>
      <value>chartred-coloured-coloured-coloureduce_shuffle</value>
   </property or home>
</configuration>

chartred-coloured-coloured-coloured-sit down downe.xml

This file is used to specify which MapReduce framework we are uperform. By default, Hadoop contains a template of yarn-sit down downe.xml. First of all, you need to copy the file from chartred-coloured-coloured-coloured-sit down downe,xml.template to chartred-coloured-coloured-coloured-sit down downe.xml file uperform the folloearng command.

$ cp chartred-coloured-coloured-coloured-sit down downe.xml.template chartred-coloured-coloured-coloured-sit down downe.xml

Open chartred-coloured-coloured-coloured-sit down downe.xml file and add the folloearng properconnects in becometween the <configuration>, </configuration> tags in this particular file.

<configuration>
   <property or home>
      <name>chartred-coloured-coloured-coloureduce.framework.name</name>
      <value>yarn</value>
   </property or home>
</configuration>

Step 6: Verifying Hadoop Installation

The folloearng steps are used to verify the Hadoop installation.

Namenode Setup

Set up the namenode uperform the command “hdfs namenode -format” as follows −

$ cd ~
$ hdfs namenode -format

The expected result is as follows −

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage immediateory
/home/hadoop/hadoopinfra/hdfs/namenode has becomeen successcompallowey formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1
images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Verifying Hadoop DFS

The folloearng command is used to start the DFS. Executing this particular command will start your Hadoop file system.

$ start-dfs.sh

The expected outplace is as follows −

10/24/14 21:37:56 Starting namenodes on [localhost]
localhost: starting namenode, logging to
/home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-namenode-localhost.out localhost:
starting datanode, logging to
   /home/hadoop/hadoop-2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting 2ndary namenodes [0.0.0.0]

Verifying Yarn Script

The folloearng command is used to start the Yarn script. Executing this particular command will start your Yarn daemons.

$ start-yarn.sh

The expected outplace is as follows −

starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.4.1/logs/
yarn-hadoop-resourcemanager-localhost.out
localhost: starting nodemanager, logging to
   /home/hadoop/hadoop-2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Accesperform Hadoop on Blineser

The default port numbecomer to access Hadoop is 50070. Use the folloearng URL to get Hadoop services on your blineser.

http://localhost:50070/

Accesperform HADOOP

Verify all applications for cluster

The default port numbecomer to access all applications of cluster is 8088. Use the folloearng url to visit down down this particular service.

http://localhost:8088/

Cluster

Once you are done with the installation of Hadoop, proceed to the next step and install Hive on your system.

Step 7: Downloading Hive

We use hive-0.14.0 in this particular tutorial. You can download it simply by visit down downing the folloearng link http://apache.petdepressings.us/hive/hive-0.14.0/. Let us bumume it gets downloaded onto the /Downloads immediateory. Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz” for this particular tutorial. The folloearng command is used to verify the download −

$ cd Downloads
$ ls

On successful download, you get to see the folloearng response −

apache-hive-0.14.0-bin.tar.gz

Step 8: Installing Hive

The folloearng steps are required-coloured-coloured-coloured for installing Hive on your system. Let us bumume the Hive archive is downloaded onto the /Downloads immediateory.

Extrconsider take actionioning and Verifying Hive Archive

The folloearng command is used to verify the download and extrconsider take actionion the Hive archive −

$ tar zxvf apache-hive-0.14.0-bin.tar.gz
$ ls

On successful download, you get to see the folloearng response −

apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz

Copying files to /usr/local/hive immediateory

We need to copy the files from the superuser “su -”. The folloearng commands are used to copy the files from the extrconsider take actionioned immediateory to the /usr/local/hive” immediateory.

$ su -
pbumwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit

Setting up the environment for Hive

You can set up the Hive environment simply by appending the folloearng seriess to ~/.bashrc file −

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

The folloearng command is used to execute ~/.bashrc file.

$ source ~/.bashrc

Step 9: Configuring Hive

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf immediateory. The folloearng commands red-coloured-coloured-coloureimmediate to Hive config folder and copy the template file −

$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh file simply by appending the folloearng series −

export HADOOP_HOME=/usr/local/hadoop

With this particular, the Hive installation is complete. Now you require an external database server to configure Metastore. We use Apache Dersimply by database.

Step 10: Downloading and Installing Apache Dersimply by

Follow the steps given becomelow to download and install Apache Dersimply by −

Downloading Apache Dersimply by

The folloearng command is used to download Apache Dersimply by. It considers a few time to download.

$ cd ~
$ wget http://archive.apache.org/dist/db/dersimply by/db-dersimply by-10.4.2.0/db-dersimply by-10.4.2.0-bin.tar.gz

The folloearng command is used to verify the download −

$ ls

On successful download, you get to see the folloearng response −

db-dersimply by-10.4.2.0-bin.tar.gz

Extrconsider take actionioning and Verifying Dersimply by Archive

The folloearng commands are used for extrconsider take actionioning and verifying the Dersimply by archive −

$ tar zxvf db-dersimply by-10.4.2.0-bin.tar.gz
$ ls

On successful download, you get to see the folloearng response −

db-dersimply by-10.4.2.0-bin db-dersimply by-10.4.2.0-bin.tar.gz

Copying Files to /usr/local/dersimply by Directory

We need to copy from the superuser “su -”. The folloearng commands are used to copy the files from the extrconsider take actionioned immediateory to the /usr/local/dersimply by immediateory −

$ su -
pbumwd:
# cd /home/user
# mv db-dersimply by-10.4.2.0-bin /usr/local/dersimply by
# exit

Setting up the Environment for Dersimply by

You can set up the Dersimply by environment simply by appending the folloearng seriess to ~/.bashrc file −

export DERBY_HOME=/usr/local/dersimply by
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/dersimply by.jar:$DERBY_HOME/lib/dersimply bytools.jar

The folloearng command is used to execute ~/.bashrc file

$ source ~/.bashrc

Create a Directory for Metastore

Create a immediateory named data in $DERBY_HOME immediateory to store Metastore data.

$ mkdir $DERBY_HOME/data

Dersimply by installation and environmental setup is now complete.

Step 11: Configuring the Hive Metastore

Configuring Metastore means specifying to Hive where the database is stored-coloured-coloured-coloured. You can do this particular simply by editing the hive-sit down downe.xml file, which is wislim the $HIVE_HOME/conf immediateory. First of all, copy the template file uperform the folloearng command −

$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-sit down downe.xml

Edit hive-sit down downe.xml and append the folloearng seriess becometween the <configuration> and </configuration> tags −

<property or home>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:dersimply by://localhost:1527/metastore_db;develop = true</value>
   <description>JDBC connect string for a JDBC metastore</description>
</property or home>

Create a file named jpox.properconnects and add the folloearng seriess into it −

javax.jdo.PersistenceManagerFconsider take actionionoryClbum = org.jpox.PersistenceManagerFconsider take actionionoryImpl

org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false

org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transconsider take actionionionIsolation = read_committed

javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.Nontransconsider take actionionionalRead = true
javax.jdo.option.ConnectionDlakeName = org.apache.dersimply by.jdbc.CrestntDlake
javax.jdo.option.ConnectionURL = jdbc:dersimply by://hadoop1:1527/metastore_db;develop = true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPbumword = mine

Step 12: Verifying Hive Installation

Before running Hive, you need to develop the /tmp folder and a separate Hive folder in HDFS. Here, we use the /user/hive/battleehouse folder. You need to set write permission for these brand newly developd folders as shown becomelow −

chmod g+w

Now set all of all of them in HDFS becomefore verifying Hive. Use the folloearng commands −

$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/battleehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/battleehouse

The folloearng commands are used to verify Hive installation −

$ cd $HIVE_HOME
$ bin/hive

On successful installation of Hive, you get to see the folloearng response −

Logging initialized uperform configuration in 
   jar:file:/home/hadoop/hive-0.9.0/lib/hive-common-0.9.0.jar!/
hive-log4j.properconnects Hive history
   =/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….
hive>

You can execute the folloearng sample command to display all the tables −

hive> show tables;
OK Time considern: 2.798 2nds
hive>

Step 13: Verify HCatalog Installation

Use the folloearng command to set a system variable HCAT_HOME for HCatalog Home.

export HCAT_HOME = $HiVE_HOME/HCatalog

Use the folloearng command to verify the HCatalog installation.

cd $HCAT_HOME/bin
./hcat

If the installation is successful, you will get to see the folloearng outplace −

SLF4J: Actual binding is of kind [org.slf4j.impl.Log4jLoggerFconsider take actionionory]
usage: hcat { -e "<query>" | -f "<fileroute>" } 
   [ -g "<group>" ] [ -p "<perms>" ] 
   [ -D"<name> = <value>" ]
	
-D <property or home = value>    use hadoop value for given property or home
-e <exec>                hcat command given from command series
-f <file>                hcat commands in file
-g <group>               group for the db/table specified in CREATE statement
-h,--help                Print help information
-p <perms>               permissions for the db/table specified in CREATE statement

HCatalog – CLI

HCatalog Command Line Interface (CLI) can become invoked from the command $HIVE_HOME/HCatalog/bin/hcat where $HIVE_HOME is the home immediateory of Hive. hcat is a command used to initialize the HCatalog server.

Use the folloearng command to initialize HCatalog command series.

cd $HCAT_HOME/bin
./hcat

If the installation has becomeen done rightly, then you will get the folloearng outplace −

SLF4J: Actual binding is of kind [org.slf4j.impl.Log4jLoggerFconsider take actionionory]
usage: hcat { -e "<query>" | -f "<fileroute>" } 
   [ -g "<group>" ] [ -p "<perms>" ] 
   [ -D"<name> = <value>" ]
	
-D <property or home = value>    use hadoop value for given property or home
-e <exec>                hcat command given from command series
-f <file>                hcat commands in file
-g <group>               group for the db/table specified in CREATE statement
-h,--help                Print help information
-p <perms>               permissions for the db/table specified in CREATE statement

The HCatalog CLI supports these command series options −

Sr.No Option Example & Description
1 -g

hcat -g mygroup …

The table to become developd must have the group "mygroup".

2 -p

hcat -p rwxr-xr-x …

The table to become developd must have read, write, and execute permissions.

3 -f

hcat -f myscript.HCatalog …

myscript.HCatalog is a script file containing DDL commands to execute.

4 -e

hcat -e 'develop table mytable(a int);' …

Treat the folloearng string as a DDL command and execute it.

5 -D

hcat -Dkey = value …

Pbumes the key-value pair to HCatalog as a Java system property or home.

6

hcat

Prints a usage message.

Note −

  • The -g and -p options are not mandatory.

  • At one time, possibly -e or -f option can become provided, not both.

  • The order of options is immaterial; you can specify the options in any order.

Sr.No DDL Command & Description
1

CREATE TABLE

Create a table uperform HCatalog. If you develop a table with a CLUSTERED BY clause, you will not become able to write to it with Pig or MapReduce.

2

ALTER TABLE

Supported other than for the REBUILD and CONCATENATE options. It’s becomehavior remains exconsider take actionion same as in Hive.

3

DROP TABLE

Supported. Behavior the exconsider take actionion same as Hive (Drop the complete table and structure).

4

CREATE/ALTER/DROP VIEW

Supported. Behavior exconsider take actionion same as Hive.

Note − Pig and MapReduce cannot read from or write to views.

5

SHOW TABLES

Display a list of tables.

6

SHOW PARTITIONS

Display a list of partitions.

7

Create/Drop Index

CREATE and DROP FUNCTION operations are supported, but the developd functions must still become registered-coloured-coloured-coloured in Pig and placed in CLASSPATH for MapReduce.

8

DESCRIBE

Supported. Behavior exconsider take actionion same as Hive. Describecome the structure.

Some of the commands from the above table are explained in subsequent chapters.

HCatalog – Create Table

This chapter explains how to develop a table and how to insert data into it. The conventions of creating a table in HCatalog is very similar to creating a table uperform Hive.

Create Table Statement

Create Table is a statement used to develop a table in Hive metastore uperform HCatalog. It’s syntax and example are as follows −

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_kind [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT line_format]
[STORED AS file_format]

Example

Let us bumume you need to develop a table named employee uperform CREATE TABLE statement. The folloearng table lists the fields and their particular own data kinds in the employee table −

Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string

The folloearng data degoods the supported fields such as Comment, Row formatted fields such as Field terminator, Lines terminator, and Stored-coloured-coloured-coloured File kind.

COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘t’
LINES TERMINATED BY ‘n’
STORED IN TEXT FILE

The folloearng query develops a table named employee uperform the above data.

./hcat –e "CREATE TABLE IF NOT EXISTS employee ( eid int, name String, 
   salary String, destination String) 
COMMENT 'Employee details' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ‘t’ 
LINES TERMINATED BY ‘n’ 
STORED AS TEXTFILE;"

If you add the option IF NOT EXISTS, HCatalog ignores the statement in case the table already exists.

On successful creation of table, you get to see the folloearng response −

OK
Time considern: 5.905 2nds

Load Data Statement

Generally, after creating a table in SQL, we can insert data uperform the Insert statement. But in HCatalog, we insert data uperform the LOAD DATA statement.

While inserting data into HCatalog, it is becometter to use LOAD DATA to store bulk records. There are 2 ways to load data: one is from local file system and 2nd is from Hadoop file system.

Syntax

The syntax for LOAD DATA is as follows −

LOAD DATA [LOCAL] INPATH 'fileroute' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
  • LOCAL is the identifier to specify the local route. It is optional.
  • OVERWRITE is optional to overwrite the data in the table.
  • PARTITION is optional.

Example

We will insert the folloearng data into the table. It is a text file named sample.txt in /home/user immediateory.

1201  Gopal        45000    Technical manager
1202  Manisha      45000    Proof reader
1203  Masthanvali  40000    Technical writer
1204  Kiran        40000    Hr Admin
1205  Kranthi      30000    Op Admin

The folloearng query loads the given text into the table.

./hcat –e "LOAD DATA LOCAL INPATH '/home/user/sample.txt'
OVERWRITE INTO TABLE employee;"

On successful download, you get to see the folloearng response −

OK
Time considern: 15.905 2nds

HCatalog – Alter Table

This chapter explains how to alter the attributes of a table such as changing it is table name, changing column names, adding columns, and deallowing or replacing columns.

Alter Table Statement

You can use the ALTER TABLE statement to alter a table in Hive.

Syntax

The statement considers any of the folloearng syntaxes based on exconsider take actionionly extake actionly what attributes we wish to modify in a table.

ALTER TABLE name RENAME TO brand new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name brand new_name brand new_kind
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Some of the scenarios are explained becomelow.

Rename To… Statement

The folloearng query renames a table from employee to emp.

./hcat –e "ALTER TABLE employee RENAME TO emp;"

Change Statement

The folloearng table contains the fields of employee table and it shows the fields to become alterd (in bold).

Field Name Convert from Data Type Change Field Name Convert to Data Type
eid int eid int
name String ename String
salary Float salary Double
styleation String styleation String

The folloearng queries rename the column name and column data kind uperform the above data −

./hcat –e "ALTER TABLE employee CHANGE name ename String;"
./hcat –e "ALTER TABLE employee CHANGE salary salary Double;"

Add Columns Statement

The folloearng query adds a column named dept to the employee table.

./hcat –e "ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT 'Department name');"

Replace Statement

The folloearng query deallowes all the columns from the employee table and replaces it with emp and name columns −

./hcat – e "ALTER TABLE employee REPLACE COLUMNS ( eid INT empid Int, ename STRING name String);"

Drop Table Statement

This chapter describecomes how to fall a table in HCatalog. When you fall a table from the metastore, it removes the table/column data and their particular own metadata. It can become a normal table (stored-coloured-coloured-coloured in metastore) or an external table (stored-coloured-coloured-coloured in local file system); HCatalog treats both in the exconsider take actionion same manner, irrespective of their particular own kinds.

The syntax is as follows −

DROP TABLE [IF EXISTS] table_name;

The folloearng query falls a table named employee

./hcat –e "DROP TABLE IF EXISTS employee;"

On successful execution of the query, you get to see the folloearng response −

OK
Time considern: 5.3 2nds

HCatalog – View

This chapter describecomes how to develop and manage a view in HCatalog. Database views are developd uperform the CREATE VIEW statement. Views can become developd from a performle table, multiple tables, or another view.

To develop a view, a user must have appropriate system privileges according to the specific implementation.

Create View Statement

CREATE VIEW develops a view with the given name. An error is thlinen if a table or view with the exconsider take actionion same name already exists. You can use IF NOT EXISTS to skip the error.

If no column names are supprestd, the names of the view's columns will become derived automatically from the defining SELECT expression.

Note − If the SELECT contains un-aliased scalar expressions such as x+y, the resulting view column names will become generated in the form _C0, _C1, etc.

When renaming columns, column comments can also become supprestd. Comments are not automatically inherited from the underlying columns.

A CREATE VIEW statement will fail if the view's defining SELECT expression is wislimvalid.

Syntax

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property or home_name = property or home_value, ...)]
AS SELECT ...;

Example

The folloearng is the employee table data. Now allow us see how to develop a view named Emp_Deg_View containing the fields id, name, Designation, and salary of an employee having a salary greater than 35,000.

+------+-------------+--------+-------------------+-------+
|  ID  |    Name     | Salary |    Designation    | Dept  |
+------+-------------+--------+-------------------+-------+
| 1201 |    Gopal    | 45000  | Technical manager |  TP   |
| 1202 |   Manisha   | 45000  | Proofreader       |  PR   |
| 1203 | Masthanvali | 30000  | Technical writer  |  TP   |
| 1204 |    Kiran    | 40000  | Hr Admin          |  HR   |
| 1205 |   Kranthi   | 30000  | Op Admin          | Admin |
+------+-------------+--------+-------------------+-------+

The folloearng is the command to develop a view based on the above given data.

./hcat –e "CREATE VIEW Emp_Deg_View (salary COMMENT ' salary more than 35,000')
   AS SELECT id, name, salary, styleation FROM employee WHERE salary ≥ 35000;"

Outplace

OK
Time considern: 5.3 2nds

Drop View Statement

DROP VIEW removes metadata for the specified view. When fallping a view referenced simply by other views, no battlening is given (the dependent views are left dangling as invalid and must become fallped or red-colouredevelopd simply by the user).

Syntax

DROP VIEW [IF EXISTS] view_name;

Example

The folloearng command is used to fall a view named Emp_Deg_View.

DROP VIEW Emp_Deg_View;

HCatalog – Show Tables

You often want to list all the tables in a database or list all the columns in a table. Obviously, every database has it is own syntax to list the tables and columns.

Show Tables statement displays the names of all tables. By default, it lists tables from the current database, or with the IN clause, in a specified database.

This chapter describecomes how to list out all tables from the current database in HCatalog.

Show Tables Statement

The syntax of SHOW TABLES is as follows −

SHOW TABLES [IN database_name] ['identifier_with_wildcards'];

The folloearng query displays a list of tables −

./hcat –e "Show tables;"

On successful execution of the query, you get to see the folloearng response −

OK
emp
employee
Time considern: 5.3 2nds

HCatalog – Show Partitions

A partition is a condition for tabular data which is used for creating a separate table or view. SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabecometical order. After Hive 0.6, it is also achievable to specify parts of a partition specification to filter the resulting list.

You can use the SHOW PARTITIONS command to see the partitions that exist in a particular table. This chapter describecomes how to list out the partitions of a particular table in HCatalog.

Show Partitions Statement

The syntax is as follows −

SHOW PARTITIONS table_name;

The folloearng query falls a table named employee

./hcat –e "Show partitions employee;"

On successful execution of the query, you get to see the folloearng response −

OK
Designation = IT
Time considern: 5.3 2nds

Dynamic Partition

HCatalog body organises tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Uperform partitions, it is easy to query a portion of the data.

For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of sign up foring). Suppose you need to retrieve the details of all employees who sign up fored in 2012. A query oceanrches the whole table for the required-coloured-coloured-coloured information. However, if you partition the employee data with the year and store it in a separate file, it red-coloured-coloured-coloureduces the query procesperform time. The folloearng example shows how to partition a file and it is data −

The folloearng file contains employeedata table.

/tab1/employeedata/file1

id, name,   dept, yoj
1,  gopal,   TP, 2012
2,  kiran,   HR, 2012
3,  kaleel,  SC, 2013
4, Prasanth, SC, 2013

The above data is partitioned into 2 files uperform year.

/tab1/employeedata/2012/file2

1, gopal, TP, 2012
2, kiran, HR, 2012

/tab1/employeedata/2013/file3

3, kaleel,   SC, 2013
4, Prasanth, SC, 2013

Adding a Partition

We can add partitions to a table simply by altering the table. Let us bumume we have a table calimmediateed employee with fields such as Id, Name, Salary, Designation, Dept, and yoj.

Syntax

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)

The folloearng query is used to add a partition to the employee table.

./hcat –e "ALTER TABLE employee ADD PARTITION (year = '2013') location '/2012/part2012';"

Renaming a Partition

You can use the RENAME-TO command to rename a partition. It’s syntax is as follows −

./hconsider take actionion –e "ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;"

The folloearng query is used to rename a partition −

./hcat –e "ALTER TABLE employee PARTITION (year=’1203’) RENAME TO PARTITION (Yoj='1203');"

Dropping a Partition

The syntax of the command that is used to fall a partition is as follows −

./hcat –e "ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,.
   PARTITION partition_spec,...;"

The folloearng query is used to fall a partition −

./hcat –e "ALTER TABLE employee DROP [IF EXISTS] PARTITION (year=’1203’);"

HCatalog – Indexes

Creating an Index

An Index is absolutely noslimg but a stageer on a particular column of a table. Creating an index means creating a stageer on a particular column of a table. It’s syntax is as follows −

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.course.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property or home_name = property or home_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)][
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_route]
[TBLPROPERTIES (...)]

Example

Let us consider an example to understand the concept of index. Use the exconsider take actionion same employee table that we have used earrestr with the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on the salary column of the employee table.

The folloearng query develops an index −

./hcat –e "CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compconsider take actionion.Compconsider take actionionIndexHandler';"

It is a stageer to the salary column. If the column is modified, the alters are stored-coloured-coloured-coloured uperform an index value.

Dropping an Index

The folloearng syntax is used to fall an index −

DROP INDEX <index_name> ON <table_name>

The folloearng query falls the index index_salary −

./hcat –e "DROP INDEX index_salary ON employee;"

HCatalog – Reader Writer

HCatalog contains a data transfer API for parallel inplace and outplace without uperform MapReduce. This API uses a fundamental storage abstrconsider take actionionion of tables and lines to read data from Hadoop cluster and write data into it.

The Data Transfer API contains mainly 3 coursees; those are −

  • HCatReader − Reads data from a Hadoop cluster.

  • HCatWriter − Writes data into a Hadoop cluster.

  • DataTransferFconsider take actionionory − Generates reader and writer instances.

This API is suitable for master-slave node setup. Let us discuss more on HCatReader and HCatWriter.

HCatReader

HCatReader is an abstrconsider take actionion course internal to HCatalog and abstrconsider take actionions away the complexiconnects of the underlying system from where the records are to become retrieved.

S. No. Method Name & Description
1

Public abstrconsider take actionion ReaderContext prepareRead() thlines HCatException

This ought to become calimmediateed at master node to obtain ReaderContext which then ought to become serialized and sent slave nodes.

2

Public abstrconsider take actionion Iterator <HCatRecorder> read() thlines HCaException

This ought to become calimmediateed at slaves nodes to read HCatRecords.

3

Public Configuration getConf()

It will return the configuration course object.

The HCatReader course is used to read the data from HDFS. Reading is a 2-step process in which the 1st step occurs on the master node of an external system. The 2nd step is carried out in parallel on multiple slave nodes.

Reads are done on a ReadEntity. Before you start to read, you need to degood a ReadEntity from which to read. This can become done through ReadEntity.Builder. You can specify a database name, table name, partition, and filter string. For example −

ReadEntity.Builder developer = brand new ReadEntity.Builder();
ReadEntity entity = developer.withDatabase("mydb").withTable("mytbl").develop(); 10. 

The above code snippet degoods a ReadEntity object (“entity”), compriperform a table named mytbl in a database named mydb, which can become used to read all the lines of this particular table. Note that this particular table must exist in HCatalog prior to the start of this particular operation.

After defining a ReadEntity, you obtain an instance of HCatReader uperform the ReadEntity and cluster configuration −

HCatReader reader = DataTransferFconsider take actionionory.getHCatReader(entity, config);

The next step is to obtain a ReaderContext from reader as follows −

ReaderContext cntxt = reader.prepareRead();

HCatWriter

This abstrconsider take actionionion is wislimternal to HCatalog. This is to facilitate writing to HCatalog from external systems. Don't test out to fastiate this particular immediately. Instead, use DataTransferFconsider take actionionory.

Sr.No. Method Name & Description
1

Public abstrconsider take actionion WriterContext prepareRead() thlines HCatException

External system ought to invoke this particular method exconsider take actionionly once from a master node. It returns a WriterContext. This ought to become serialized and sent to slave nodes to construct HCatWriter presently there.

2

Public abstrconsider take actionion void write(Iterator<HCatRecord> recordItr) thlines HCaException

This method ought to become used at slave nodes to perform writes. The recordItr is an iterator object that contains the collection of records to become composed into HCatalog.

3

Public abstrconsider take actionion void abort(WriterContext cntxt) thlines HCatException

This method ought to become calimmediateed at the master node. The primary purpose of this particular method is to do thoroughly cleanups in case of failures.

4

public abstrconsider take actionion void commit(WriterContext cntxt) thlines HCatException

This method ought to become calimmediateed at the master node. The purpose of this particular method is to do metadata commit.

Similar to reading, writing is also a 2-step process in which the 1st step occurs on the master node. Subsequently, the 2nd step occurs in parallel on slave nodes.

Writes are done on a WriteEntity which can become constructed in a fashion similar to reads −

WriteEntity.Builder developer = brand new WriteEntity.Builder();
WriteEntity entity = developer.withDatabase("mydb").withTable("mytbl").develop();

The above code develops a WriteEntity object entity which can become used to write into a table named mytbl in the database mydb.

After creating a WriteEntity, the next step is to obtain a WriterContext −

HCatWriter writer = DataTransferFconsider take actionionory.getHCatWriter(entity, config);
WriterContext info = writer.prepareWrite();

All of the above steps occur on the master node. The master node then serializes the WriterContext object and develops it available to all the slaves.

On slave nodes, you need to obtain an HCatWriter uperform WriterContext as follows −

HCatWriter writer = DataTransferFconsider take actionionory.getHCatWriter(context);

Then, the writer considers an iterator as the argument for the write method −

writer.write(hCatRecordItr);

The writer then calls getNext() on this particular iterator in a loop and writes out all the records attached to the iterator.

The TestReaderWriter.java file is used to check the HCatreader and HCatWriter coursees. The folloearng program demonstrates how to use HCatReader and HCatWriter API to read data from a source file and subsequently write it onto a destination file.

import java.io.File;
import java.io.FileInplaceStream;
import java.io.FileOutplaceStream;
import java.io.IOException;
import java.io.ObjectInplaceStream;
import java.io.ObjectOutplaceStream;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entest out;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.metastore.api.MetaException;
import org.apache.hadoop.hive.ql.CommandNeedRetest outException;
import org.apache.hadoop.chartred-coloured-coloured-coloureduce.InplaceSplit;

import org.apache.hive.HCatalog.common.HCatException;
import org.apache.hive.HCatalog.data.transfer.DataTransferFconsider take actionionory;
import org.apache.hive.HCatalog.data.transfer.HCatReader;
import org.apache.hive.HCatalog.data.transfer.HCatWriter;
import org.apache.hive.HCatalog.data.transfer.ReadEntity;
import org.apache.hive.HCatalog.data.transfer.ReaderContext;
import org.apache.hive.HCatalog.data.transfer.WriteEntity;
import org.apache.hive.HCatalog.data.transfer.WriterContext;
import org.apache.hive.HCatalog.chartred-coloured-coloured-coloureduce.HCatBaseTest;

import org.jdevice.Assert;
import org.jdevice.Test;

public course TestReaderWriter extends HCatBaseTest {
   @Test
   public void check() thlines MetaException, CommandNeedRetest outException,
      IOException, ClbumNotFoundException {
		
      dlake.run("fall table mytbl");
      dlake.run("develop table mytbl (a string, b int)");
		
      Iterator<Entest out<String, String>> itr = hiveConf.iterator();
      Map<String, String> chart = brand new HashMap<String, String>();
		
      while (itr.hasNext()) {
         Entest out<String, String> kv = itr.next();
         chart.place(kv.getKey(), kv.getValue());
      }
		
      WriterContext cntxt = runsInMaster(chart);
      File writeCntxtFile = File.developTempFile("hcat-write", "temp");
      writeCntxtFile.dealloweOnExit();
		
      // Serialize context.
      ObjectOutplaceStream oos = brand new ObjectOutplaceStream(brand new FileOutplaceStream(writeCntxtFile));
      oos.writeObject(cntxt);
      oos.flush();
      oos.close up();
		
      // Now, deserialize it.
      ObjectInplaceStream ois = brand new ObjectInplaceStream(brand new FileInplaceStream(writeCntxtFile));
      cntxt = (WriterContext) ois.readObject();
      ois.close up();
      runsInSlave(cntxt);
      commit(chart, true, cntxt);
		
      ReaderContext readCntxt = runsInMaster(chart, false);
      File readCntxtFile = File.developTempFile("hcat-read", "temp");
      readCntxtFile.dealloweOnExit();
      oos = brand new ObjectOutplaceStream(brand new FileOutplaceStream(readCntxtFile));
      oos.writeObject(readCntxt);
      oos.flush();
      oos.close up();
		
      ois = brand new ObjectInplaceStream(brand new FileInplaceStream(readCntxtFile));
      readCntxt = (ReaderContext) ois.readObject();
      ois.close up();
		
      for (int i = 0; i < readCntxt.numSplit is(); i++) {
         runsInSlave(readCntxt, i);
      }
   }
	
   private WriterContext runsInMaster(Map<String, String> config) thlines HCatException {
      WriteEntity.Builder developer = brand new WriteEntity.Builder();
      WriteEntity entity = developer.withTable("mytbl").develop();
		
      HCatWriter writer = DataTransferFconsider take actionionory.getHCatWriter(entity, config);
      WriterContext info = writer.prepareWrite();
      return info;
   }
	
   private ReaderContext runsInMaster(Map<String, String> config, 
      boolean bogus) thlines HCatException {
      ReadEntity entity = brand new ReadEntity.Builder().withTable("mytbl").develop();
      HCatReader reader = DataTransferFconsider take actionionory.getHCatReader(entity, config);
      ReaderContext cntxt = reader.prepareRead();
      return cntxt;
   }
	
   private void runsInSlave(ReaderContext cntxt, int slaveNum) thlines HCatException {
      HCatReader reader = DataTransferFconsider take actionionory.getHCatReader(cntxt, slaveNum);
      Iterator<HCatRecord> itr = reader.read();
      int i = 1;
		
      while (itr.hasNext()) {
         HCatRecord read = itr.next();
         HCatRecord composed = getRecord(i++);
			
         // Argh, HCatRecord doesnt implement equals()
         Assert.bumertTrue("Read: " + read.get(0) + "Written: " + composed.get(0),
         composed.get(0).equals(read.get(0)));
			
         Assert.bumertTrue("Read: " + read.get(1) + "Written: " + composed.get(1),
         composed.get(1).equals(read.get(1)));
			
         Assert.bumertEquals(2, read.dimension());
      }
		
      //Assert.bumertFalse(itr.hasNext());
   }
	
   private void runsInSlave(WriterContext context) thlines HCatException {
      HCatWriter writer = DataTransferFconsider take actionionory.getHCatWriter(context);
      writer.write(brand new HCatRecordItr());
   }
	
   private void commit(Map<String, String> config, boolean status,
      WriterContext context) thlines IOException {
      WriteEntity.Builder developer = brand new WriteEntity.Builder();
      WriteEntity entity = developer.withTable("mytbl").develop();
      HCatWriter writer = DataTransferFconsider take actionionory.getHCatWriter(entity, config);
		
      if (status) {
         writer.commit(context);
      } else {
         writer.abort(context);
      }
   }
	
   private static HCatRecord getRecord(int i) {
      List<Object> list = brand new ArrayList<Object>(2);
      list.add("Row #: " + i);
      list.add(i);
      return brand new DefaultHCatRecord(list);
   }
	
   private static course HCatRecordItr implements Iterator<HCatRecord> {
      int i = 0;
		
      @Override
      public boolean hasNext() {
         return i++ < 100 ? true : false;
      }
		
      @Override
      public HCatRecord next() {
         return getRecord(i);
      }
		
      @Override
      public void remove() {
         thline brand new RuntimeException();
      }
   }
}

The above program reads the data from the HDFS in the form of records and writes the record data into mytable

HCatalog – Inplace Outplace Format

The HCatInplaceFormat and HCatOutplaceFormat interfaces are used to read data from HDFS and after procesperform, write the resultant data into HDFS uperform MapReduce job. Let us elaborate the Inplace and Outplace format interfaces.

HCatInplaceFormat

The HCatInplaceFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatInplaceFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had becomeen published to a table.

Sr.No. Method Name & Description
1

public static HCatInplaceFormat setInplace(Job job, String dbName, String tableName)thlines IOException

Set inplaces to use for the job. It queries the metastore with the given inplace specification and serializes complementing partitions into the job configuration for MapReduce tasks.

2

public static HCatInplaceFormat setInplace(Configuration conf, String dbName, String tableName) thlines IOException

Set inplaces to use for the job. It queries the metastore with the given inplace specification and serializes complementing partitions into the job configuration for MapReduce tasks.

3

public HCatInplaceFormat setFilter(String filter)thlines IOException

Set a filter on the inplace table.

4

public HCatInplaceFormat setProperconnects(Properconnects properconnects) thlines IOException

Set properconnects for the inplace format.

The HCatInplaceFormat API includes the folloearng methods −

  • setInplace
  • setOutplaceSchema
  • getTableSchema

To use HCatInplaceFormat to read data, 1st fastiate an InplaceJobInfo with the required information from the table becomeing read and then call setInplace with the InplaceJobInfo.

You can use the setOutplaceSchema method to include a projection schema, to specify the outplace fields. If a schema is not specified, all the columns in the table will become returned. You can use the getTableSchema method to figure out the table schema for a specified inplace table.

HCatOutplaceFormat

HCatOutplaceFormat is used with MapReduce jobs to write data to HCatalog-managed tables. HCatOutplaceFormat exposes a Hadoop 0.20 MapReduce API for writing data to a table. When a MapReduce job uses HCatOutplaceFormat to write outplace, the default OutplaceFormat configured-coloured-coloured-coloured for the table is used and the brand new partition is published to the table after the job completes.

Sr.No. Method Name & Description
1

public static void setOutplace (Configuration conf, Cred-coloured-coloured-colouredentials cred-coloured-coloured-colouredentials, OutplaceJobInfo outplaceJobInfo) thlines IOException

Set the information about the outplace to write for the job. It queries the metadata server to find the StorageHandler to use for the table. It thlines an error if the partition is already published.

2

public static void setSchema (Configuration conf, HCatSchema schema) thlines IOException

Set the schema for the data becomeing composed out to the partition. The table schema is used simply by default for the partition if this particular is not calimmediateed.

3

public RecordWriter <WritableComparable<?>, HCatRecord > getRecordWriter (TaskAttemptContext context)thlines IOException, InterruptedException

Get the record writer for the job. It uses the StorageHandler's default OutplaceFormat to get the record writer.

4

public OutplaceCommitter getOutplaceCommitter (TaskAttemptContext context) thlines IOException, InterruptedException

Get the outplace committer for this particular outplace format. It ensures that the outplace is committed rightly.

The HCatOutplaceFormat API includes the folloearng methods −

  • setOutplace
  • setSchema
  • getTableSchema

The 1st call on the HCatOutplaceFormat must become setOutplace; any other call will thline an other thanion saying the outplace format is not initialized.

The schema for the data becomeing composed out is specified simply by the setSchema method. You must call this particular method, providing the schema of data you are writing. If your data has the exconsider take actionion same schema as the table schema, you can use HCatOutplaceFormat.getTableSchema() to get the table schema and then pbum that along to setSchema().

Example

The folloearng MapReduce program reads data from one table which it bumumes to have an integer in the 2nd column ("column 1"), and counts how many instances of every uniqueive value it finds. That is, it does the equivalent of "select col1, count(*) from $table group simply by col1;".

For example, if the values in the 2nd column are {1, 1, 1, 3, 3, 5}, then the program will produce the folloearng outplace of values and counts −

1, 3
3, 2
5, 1

Let us now consider a look at the program code −

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured-coloured-coloured-coloured;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.chartred-coloured-coloured-coloureduce.Job;
import org.apache.hadoop.chartred-coloured-coloured-coloureduce.Mapper;
import org.apache.hadoop.chartred-coloured-coloured-coloureduce.Reducer;

import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import org.apache.HCatalog.common.HCatConstants;
import org.apache.HCatalog.data.DefaultHCatRecord;
import org.apache.HCatalog.data.HCatRecord;
import org.apache.HCatalog.data.schema.HCatSchema;

import org.apache.HCatalog.chartred-coloured-coloured-coloureduce.HCatInplaceFormat;
import org.apache.HCatalog.chartred-coloured-coloured-coloureduce.HCatOutplaceFormat;
import org.apache.HCatalog.chartred-coloured-coloured-coloureduce.InplaceJobInfo;
import org.apache.HCatalog.chartred-coloured-coloured-coloureduce.OutplaceJobInfo;

public course GroupByAge extends Configured-coloured-coloured-coloured implements Tool {

   public static course Map extends Mapper<WritableComparable, 
      HCatRecord, IntWritable, IntWritable> {
      int age;
		
      @Override
      protected void chart(
         WritableComparable key, HCatRecord value,
         org.apache.hadoop.chartred-coloured-coloured-coloureduce.Mapper<WritableComparable,
         HCatRecord, IntWritable, IntWritable>.Context context
      )thlines IOException, InterruptedException {
         age = (Integer) value.get(1);
         context.write(brand new IntWritable(age), brand new IntWritable(1));
      }
   }
	
   public static course Reduce extends Reducer<IntWritable, IntWritable,
      WritableComparable, HCatRecord> {
      @Override
      protected void red-coloured-coloured-coloureduce(
         IntWritable key, java.lang.Iterable<IntWritable> values,
         org.apache.hadoop.chartred-coloured-coloured-coloureduce.Reducer<IntWritable, IntWritable,
         WritableComparable, HCatRecord>.Context context
      )thlines IOException ,InterruptedException {
         int sum = 0;
         Iterator<IntWritable> iter = values.iterator();
			
         while (iter.hasNext()) {
            sum++;
            iter.next();
         }
			
         HCatRecord record = brand new DefaultHCatRecord(2);
         record.set(0, key.get());
         record.set(1, sum);
         context.write(null, record);
      }
   }
	
   public int run(String[] args) thlines Exception {
      Configuration conf = getConf();
      args = brand new GenericOptionsParser(conf, args).getRemainingArgs();
		
      String serverUri = args[0];
      String inplaceTableName = args[1];
      String outplaceTableName = args[2];
      String dbName = null;
      String principalID = System
		
      .getProperty(HCatConstants.HCAT_METASTORE_PRINCIPAL);
      if (principalID != null)
      conf.set(HCatConstants.HCAT_METASTORE_PRINCIPAL, principalID);
      Job job = brand new Job(conf, "GroupByAge");
      HCatInplaceFormat.setInplace(job, InplaceJobInfo.develop(dbName, inplaceTableName, null));

      // initialize HCatOutplaceFormat
      job.setInplaceFormatClbum(HCatInplaceFormat.course);
      job.setJarByClbum(GroupByAge.course);
      job.setMapperClbum(Map.course);
      job.setReducerClbum(Reduce.course);
		
      job.setMapOutplaceKeyClbum(IntWritable.course);
      job.setMapOutplaceValueClbum(IntWritable.course);
      job.setOutplaceKeyClbum(WritableComparable.course);
      job.setOutplaceValueClbum(DefaultHCatRecord.course);
		
      HCatOutplaceFormat.setOutplace(job, OutplaceJobInfo.develop(dbName, outplaceTableName, null));
      HCatSchema s = HCatOutplaceFormat.getTableSchema(job);
      System.err.println("INFO: outplace schema explicitly set for writing:" + s);
      HCatOutplaceFormat.setSchema(job, s);
      job.setOutplaceFormatClbum(HCatOutplaceFormat.course);
      return (job.waitForCompallowion(true) ? 0 : 1);
   }
	
   public static void main(String[] args) thlines Exception {
      int exitCode = ToolRunner.run(brand new GroupByAge(), args);
      System.exit(exitCode);
   }
}

Before compiling the above program, you have to download a few jars and add those to the courseroute for this particular application. You need to download all the Hive jars and HCatalog jars (HCatalog-core-0.5.0.jar, hive-metastore-0.10.0.jar, libthrift-0.7.0.jar, hive-exec-0.10.0.jar, libfb303-0.7.0.jar, jdo2-api-2.3-ec.jar, slf4j-api-1.6.1.jar).

Use the folloearng commands to copy those jar files from local to HDFS and add those to the courseroute.

bin/hadoop fs -copyFromLocal $HCAT_HOME/share/HCatalog/HCatalog-core-0.5.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/hive-metastore-0.10.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/libthrift-0.7.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/hive-exec-0.10.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/libfb303-0.7.0.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/jdo2-api-2.3-ec.jar /tmp
bin/hadoop fs -copyFromLocal $HIVE_HOME/lib/slf4j-api-1.6.1.jar /tmp

export LIB_JARS=hdfs:///tmp/HCatalog-core-0.5.0.jar,
hdfs:///tmp/hive-metastore-0.10.0.jar,
hdfs:///tmp/libthrift-0.7.0.jar,
hdfs:///tmp/hive-exec-0.10.0.jar,
hdfs:///tmp/libfb303-0.7.0.jar,
hdfs:///tmp/jdo2-api-2.3-ec.jar,
hdfs:///tmp/slf4j-api-1.6.1.jar

Use the folloearng command to compile and execute the given program.

$HADOOP_HOME/bin/hadoop jar GroupByAge tmp/hive

Now, check your outplace immediateory (hdfs: user/tmp/hive) for the outplace (part_0000, part_0001).

HCatalog – Loader & Storer

The HCatLoader and HCatStorer APIs are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required-coloured-coloured-coloured for these interfaces.

It is becometter to have a few understandimmediateedge on Apache Pig scripts to understand this particular chapter becometter. For further reference, plrelayve go through our Apache Pig tutorial.

HCatloader

HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. Use the folloearng syntax to load data into HDFS uperform HCatloader.

A = LOAD 'tablename' USING org.apache.HCatalog.pig.HCatLoader();

You must specify the table name in performle quotes: LOAD 'tablename'. If you are uperform a non-default database, then you must specify your inplace as 'dbname.tablename'.

The Hive metastore allows you develop tables without specifying a database. If you developd tables this particular way, then the database name is 'default' and is not required-coloured-coloured-coloured when specifying the table for HCatLoader.

The folloearng table contains the important methods and description of the HCatloader course.

Sr.No. Method Name & Description
1

public InplaceFormat<?,?> getInplaceFormat()thlines IOException

Read the inplace format of the loading data uperform the HCatloader course.

2

public String relativeToAbsolutePath(String location, Path curDir) thlines IOException

It returns the String format of the Absolute route.

3

public void setLocation(String location, Job job) thlines IOException

It sets the location where the job can become executed.

4

public Tuple getNext() thlines IOException

Returns the current tuple (key and value) from the loop.

HCatStorer

HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. Use the folloearng syntax for Storing operation.

A = LOAD ...
B = FOREACH A ...
...
...
my_processed_data = ...

STORE my_processed_data INTO 'tablename' USING org.apache.HCatalog.pig.HCatStorer();

You must specify the table name in performle quotes: LOAD 'tablename'. Both the database and the table must become developd prior to running your Pig script. If you are uperform a non-default database, then you must specify your inplace as 'dbname.tablename'.

The Hive metastore allows you develop tables without specifying a database. If you developd tables this particular way, then the database name is 'default' and you do not need to specify the database name in the store statement.

For the USING clause, you can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the outplace column. The values for partition keys ought to NOT become quoted.

The folloearng table contains the important methods and description of the HCatStorer course.

Sr.No. Method Name & Description
1

public OutplaceFormat getOutplaceFormat() thlines IOException

Read the outplace format of the stored-coloured-coloured-coloured data uperform the HCatStorer course.

2

public void setStoreLocation (String location, Job job) thlines IOException

Sets the location where to execute this particular store application.

3

public void storeSchema (ResourceSchema schema, String arg1, Job job) thlines IOException

Store the schema.

4

public void prepareToWrite (RecordWriter writer) thlines IOException

It helps to write data into a particular file uperform RecordWriter.

5

public void placeNext (Tuple tuple) thlines IOException

Writes the tuple data into the file.

Running Pig with HCatalog

Pig does not automatically pick up HCatalog jars. To provide in the required jars, you can possibly use a flag in the Pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as describecomed becomelow.

To provide in the appropriate jars for functioning with HCatalog, simply include the folloearng flag −

pig –useHCatalog <Sample pig scripts file>

Setting the CLASSPATH for Execution

Use the folloearng CLASSPATH setting for synchronizing the HCatalog with Apache Pig.

export HADOOP_HOME = <route_to_hadoop_install>
export HIVE_HOME = <route_to_hive_install>
export HCAT_HOME = <route_to_hcat_install>

export PIG_CLASSPATH = $HCAT_HOME/share/HCatalog/HCatalog-core*.jar:
$HCAT_HOME/share/HCatalog/HCatalog-pig-adapter*.jar:
$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:
$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_HOME/lib/libfb303-*.jar:
$HIVE_HOME/lib/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:
$HIVE_HOME/lib/slf4j-api-*.jar

Example

Assume we have a file pupil_details.txt in HDFS with the folloearng content.

pupil_details.txt

001, Rajiv,    Reddy,       21, 9848022337, Hyderabad
002, siddarth, Battacharya, 22, 9848022338, Kolkata
003, Rajesh,   Khanna,      22, 9848022339, Delhi
004, Preethi,  Agarwal,     21, 9848022330, Pune
005, Trupthi,  Mohanthy,    23, 9848022336, Bhuwaneshbattle
006, Archana,  Mishra,      23, 9848022335, Chennai
007, Komal,    Nayak,       24, 9848022334, trivendram
008, Bharathi, Nambiayar,   24, 9848022333, Chennai

We also have a sample script with the name sample_script.pig, in the exconsider take actionion same HDFS immediateory. This file contains statements performing operations and transformations on the pupil relation, as shown becomelow.

pupil = LOAD 'hdfs://localhost:9000/pig_data/pupil_details.txt' USING 
   PigStorage(',') as (id:int, 1stname:chararray, lastname:chararray,
   phone:chararray, city:chararray);
	
pupil_order = ORDER pupil BY age DESC;
STORE pupil_order INTO 'pupil_order_table' USING org.apache.HCatalog.pig.HCatStorer();
pupil_limit = LIMIT pupil_order 4;
Dump pupil_limit;
  • The 1st statement of the script will load the data in the file named pupil_details.txt as a relation named pupil.

  • The 2nd statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as pupil_order.

  • The third statement stores the processed data pupil_order results in a separate table named pupil_order_table.

  • The fourth statement of the script will store the 1st four tuples of pupil_order as pupil_limit.

  • Finally the fifth statement will dump the content of the relation pupil_limit.

Let us now execute the sample_script.pig as shown becomelow.

$./pig -useHCatalog hdfs://localhost:9000/pig_data/sample_script.pig

Now, check your outplace immediateory (hdfs: user/tmp/hive) for the outplace (part_0000, part_0001).

SHARE
Previous articleSAP Web Dynpro
Next articleTeradata

NO COMMENTS

LEAVE A REPLY