AVRO

0
96

AVRO – Oversee

Data serialization is a mechanism to translate data in complaceer environment (like memory buffer, data structures or object state) into binary or textual form thead wear can be transsloted over network or storeddish coloureddish in a few persistent storage media.

Java and Hadoop provides serialization APIs, which are java based, but Avro is not only language independent but furthermore it is schema-based. We shall explore more difference among them in coming chapter.

Whead wear is Avro?

Apache Avro is a language-neutral data serialization system. It was generateed simply by Doug Cutting, the father of Hadoop. Since Hadoop writable clbumes lack language slotcapacity, Avro becomes very helpful, as it deals with data formats thead wear can be processed simply by multiple languages. Avro is a preferreddish coloureddish tool to serialize data in Hadoop.

Avro has a schema-based system. A language-independent schema is bumociated with it’s read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compbehave binary format, which can be deserialized simply by any kind of kind of application.

Avro uses JSON format to declare the data structures. Presently, it supslots languages such as Java, C, C++, C#, Python, and Rusimply by.

Avro Schemas

Avro depends heavily on it’s schema. It enables every data to be composed with no prior understanddirectedge of the schema. It serializes fast and the resulting serialized data is lesser in dimension. Schema is storeddish coloureddish adurationy with the Avro data in a file for any kind of kind of further processing.

In RPC, the claynt and the server exmodify schemas during the connection. This exmodify helps in the communication between same named fields, missing fields, extra fields, etc.

Avro schemas are defined with JSON thead wear simplifies it’s implementation in languages with JSON libraries.

Like Avro, generally there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.

Thrift & Protocol Buffers Vs. Avro

Thrift and Protocol Buffers are the many kind of kind of competent libraries with Avro. Avro differs from these frameworks in the folloearng ways −

  • Avro supslots both dynamic and static kinds as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their own kinds. These IDLs are used to generate code for serialization and deserialization.

  • Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem.

Unlike Thrift and Protocol Buffer, Avro's schema definition is within JSON and not in any kind of kind of proprietary IDL.

Property Avro Thrift & Protocol Buffer
Dynamic schema Yes No
Built into Hadoop Yes No
Schema in JSON Yes No
No need to compile Yes No
No need to declare IDs Yes No
Bleeding edge Yes No

Features of Avro

Listed below are a few of the prominent features of Avro −

  • Avro is a language-neutral data serialization system.

  • It can be processed simply by many kind of kind of languages (currently C, C++, C#, Java, Python, and Rusimply by).

  • Avro generates binary structureddish coloureddish format thead wear is both compressible and splittable. Hence it can be effectively used as the inplace to Hadoop MapReduce jobs.

  • Avro provides wealthy data structures. For example, you can generate a record thead wear contains an array, an enumerated kind, and a sub record. These datakinds can be generated in any kind of kind of language, can be processed in Hadoop, and the results can be fed to a third language.

  • Avro schemas defined in JSON, facilitate implementation in the languages thead wear already have JSON libraries.

  • Avro generates a self-describing file named Avro Data File, in which it stores data adurationy with it’s schema in the metadata section.

  • Avro is furthermore used in Remote Procedure Calls (RPCs). During RPC, claynt and server exmodify schemas in the connection handshake.

How to use Avro?

To use Avro, you need to follow the given workflow −

  • Step 1 − Create schemas. Here you need to design Avro schema according to your data.

  • Step 2 − Read the schemas into your program. It is done in two ways −

    • By Generating a Clbum Corresponding to Schema − Compile the schema using Avro. This generates a clbum file corresponding to the schema

    • By Using Parsers Library − You can immediately read the schema using parsers library.

  • Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific.

  • Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific.

AVRO – Serialization

Whead wear is Serialization?

Serialization is the process of translating data structures or objects state into binary or textual form to transslot the data over network or to store on a few persisten storage. Once the data is transsloted over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Serialization in Java

Java provides a mechanism, caldirected object serialization where an object can be represented as a sequence of simply bytes thead wear includes the object's data as well as information about there the object's kind and the kinds of data storeddish coloureddish in the object.

After a serialized object is composed into a file, it can be read from the file and deserialized. Thead wear is, the kind information and simply bytes thead wear represent the object and it’s data can be used to regenerate the object in memory.

ObjectInplaceStream and ObjectOutplaceStream clbumes are used to serialize and deserialize an object respectively in Java.

Serialization in Hadoop

Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage.

Interprocess Communication

  • To establish the interprocess communication between the nodes connected in a network, RPC technique was used.

  • RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the preliminary message.

  • The RPC serialization format is requireddish coloureddish to be as follows −

    • Compbehave − To generate the best use of network bandwidth, which is the many kind of kind of scarce resource in a data centre.

    • Fast − Since the communication between the nodes is crucial in distributed systems, the serialization and deserialization process need to be fast, producing less overmind.

    • Extensible − Protocols modify over time to meet brand new requirements, so it need to be straightforbattdirected to evolve the protocol in a managedirected manner for claynts and servers.

    • Interoperable − The message format need to supslot the nodes thead wear are composed in various languages.

Persistent Storage

Persistent Storage is a digital storage facility thead wear does not lose it’s data with the loss of power supply. For example – Magnetic disks and Hard Disk Drives.

Writable Interface

This is the interface in Hadoop which provides methods for serialization and deserialization. The folloearng table describes the methods −

S.No. Methods and Description
1

void readFields(DataInplace in)

This method is used to deserialize the fields of the given object.

2

void write(DataOutplace out there)

This method is used to serialize the fields of the given object.

WritableComparable Interface

It is the combination of Writable and Comparable interfaces. This withinterface inherit’s Writable interface of Hadoop as well as Comparable interface of Java. Therefore it provides methods for data serialization, deserialization, and comparison.

S.No. Methods and Description
1

int compareTo(clbum obj)

This method compares current object with the given object obj.

In addition to these clbumes, Hadoop supslots a number of wrapper clbumes thead wear implement WritableComparable interface. Each clbum wraps a Java primitive kind. The clbum hierarchy of Hadoop serialization is given below −

Hadoop Serialization Hierarchy

These clbumes are helpful to serialize various kinds of data in Hadoop. For instance, enable us consider the IntWritable clbum. Let us see how this particular clbum is used to serialize and deserialize the data in Hadoop.

IntWritable Clbum

This clbum implements Writable, Comparable, and WritableComparable interfaces. It wraps an integer data kind in it. This clbum provides methods used to serialize and deserialize integer kind of data.

Constructors

S.No. Summary
1 IntWritable()
2 IntWritable( int value)

Methods

S.No. Summary
1

int get()

Using this particular method you can get the integer value present in the current object.

2

void readFields(DataInplace in)

This method is used to deserialize the data in the given DataInplace object.

3

void set(int value)

This method is used to set the value of the current IntWritable object.

4

void write(DataOutplace out there)

This method is used to serialize the data in the current object to the given DataOutplace object.

Serializing the Data in Hadoop

The procedure to serialize the integer kind of data is discussed below.

  • Instantiate IntWritable clbum simply by wrapping an integer value in it.

  • Instantiate ByteArrayOutplaceStream clbum.

  • Instantiate DataOutplaceStream clbum and compallowe the object of ByteArrayOutplaceStream clbum to it.

  • Serialize the integer value in IntWritable object using write() method. This method needs an object of DataOutplaceStream clbum.

  • The serialized data will be storeddish coloureddish in the simply byte array object which is compalloweed as parameter to the DataOutplaceStream clbum at the time of immediateiation. Convert the data in the object to simply byte array.

Example

The folloearng example shows how to serialize data of integer kind in Hadoop −

imslot java.io.ByteArrayOutplaceStream;
imslot java.io.DataOutplaceStream;

imslot java.io.IOException;

imslot org.apache.hadoop.io.IntWritable;

public clbum Serialization {
   public simply byte[] serialize() thlines IOException{
		
      //Instantiating the IntWritable object
      IntWritable intwritable = brand new IntWritable(12);
   
      //Instantiating ByteArrayOutplaceStream object
      ByteArrayOutplaceStream simply byteout thereplaceStream = brand new ByteArrayOutplaceStream();
   
      //Instantiating DataOutplaceStream object
      DataOutplaceStream dataOutplaceStream = brand new
      DataOutplaceStream(simply byteout thereplaceStream);
   
      //Serializing the data
      intwritable.write(dataOutplaceStream);
   
      //storing the serialized object in simply bytearray
      simply byte[] simply byteArray = simply byteout thereplaceStream.toByteArray();
   
      //Closing the OutplaceStream
      dataOutplaceStream.shut();
      return(simply byteArray);
   }
	
   public static void main(String args[]) thlines IOException{
      Serialization serialization= brand new Serialization();
      serialization.serialize();
      System.out there.println();
   }
}

Deserializing the Data in Hadoop

The procedure to deserialize the integer kind of data is discussed below −

  • Instantiate IntWritable clbum simply by wrapping an integer value in it.

  • Instantiate ByteArrayOutplaceStream clbum.

  • Instantiate DataOutplaceStream clbum and compallowe the object of ByteArrayOutplaceStream clbum to it.

  • Deserialize the data in the object of DataInplaceStream using readFields() method of IntWritable clbum.

  • The deserialized data will be storeddish coloureddish in the object of IntWritable clbum. You can retrieve this particular data using get() method of this particular clbum.

Example

The folloearng example shows how to deserialize the data of integer kind in Hadoop −

imslot java.io.ByteArrayInplaceStream;
imslot java.io.DataInplaceStream;

imslot org.apache.hadoop.io.IntWritable;

public clbum Deserialization {

   public void deserialize(simply byte[]simply byteArray) thlines Exception{
   
      //Instantiating the IntWritable clbum
      IntWritable intwritable =brand new IntWritable();
      
      //Instantiating ByteArrayInplaceStream object
      ByteArrayInplaceStream InplaceStream = brand new ByteArrayInplaceStream(simply byteArray);
      
      //Instantiating DataInplaceStream object
      DataInplaceStream datainplacestream=brand new DataInplaceStream(InplaceStream);
      
      //deserializing the data in DataInplaceStream
      intwritable.readFields(datainplacestream);
      
      //printing the serialized data
      System.out there.println((intwritable).get());
   }
   
   public static void main(String args[]) thlines Exception {
      Deserialization dese = brand new Deserialization();
      dese.deserialize(brand new Serialization().serialize());
   }
}

Advantage of Hadoop over Java Serialization

Hadoop’s Writable-based serialization is capable of reddish coloureddishucing the object-creation overmind simply by reusing the Writable objects, which is not probable with the Java’s native serialization framework.

Diunhappyvantages of Hadoop Serialization

To serialize Hadoop data, generally there are two ways −

  • You can use the Writable clbumes, provided simply by Hadoop’s native library.

  • You can furthermore use Sequence Files which store the data in binary format.

The main drawback of these two mechanisms is thead wear Writables and SequenceFiles have only a Java API and they cannot be composed or read in any kind of kind of other language.

Therefore any kind of kind of of the files generated in Hadoop with above two mechanisms cannot be read simply by any kind of kind of other third language, which generates Hadoop as a limited container. To adgown this particular drawback, Doug Cutting generated Avro, which is a language independent data structure.

AVRO – Environment Setup

Apache delicatebattlee foundation provides Avro with various relrelayves. You can download the requireddish coloureddish relrelayve from Apache mirrors. Let us see, how to set up the environment to work with Avro −

Downloading Avro

To download Apache Avro, proceed with the folloearng −

  • Open the web page Apache.org. You will see the homepage of Apache Avro as shown below −

Avro Homepage

  • Click on project → relrelayves. You will get a list of relrelayves.

  • Select the latest relrelayve which leads you to a download link.

  • mirror.nexcess is one of the links where you can find the list of all libraries of various languages thead wear Avro supslots as shown below −

Avro Languages Supslots

You can select and download the library for any kind of kind of of the languages provided. In this particular tutorial, we use Java. Hence download the jar files avro-1.7.7.jar and avro-tools-1.7.7.jar.

Avro with Eclipse

To use Avro in Eclipse environment, you need to follow the steps given below −

  • Step 1. Open eclipse.

  • Step 2. Create a project.

  • Step 3. Right-click on the project name. You will get a shortcut menu.

  • Step 4. Click on Build Path. It leads you to one more shortcut menu.

  • Step 5. Click on Configure Build Path… You can see Properlinks earndow of your project as shown below −

Properlinks of Avro

  • Step 6. Under libraries tab, click on ADD EXternal JARs… button.

  • Step 7. Select the jar file avro-1.77.jar you have downloaded.

  • Step 8. Click on OK.

Avro with Maven

You can furthermore get the Avro library into your project using Maven. Given below is the pom.xml file for Avro.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="   http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

   <modelVersion>4.0.0</modelVersion>
   <groupId>Test</groupId>
   <artifbehaveId>Test</artifbehaveId>
   <version>0.0.1-SNAPSHOT</version>

   <construct>
      <sourceDirectory>src</sourceDirectory>
      <plugins>
         <plugin>
            <artifbehaveId>maven-compiler-plugin</artifbehaveId>
            <version>3.1</version>
		
            <configuration>
               <source>1.7</source>
               <target>1.7</target>
            </configuration>
		
         </plugin>
      </plugins>
   </construct>

   <dependencies>
      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifbehaveId>avro</artifbehaveId>
         <version>1.7.7</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifbehaveId>avro-tools</artifbehaveId>
         <version>1.7.7</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifbehaveId>log4j-api</artifbehaveId>
         <version>2.0-beta9</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifbehaveId>log4j-core</artifbehaveId>
         <version>2.0-beta9</version>
      </dependency>
	
   </dependencies>

</project>

Setting Clbumrout theree

To work with Avro in Linux environment, download the folloearng jar files −

  • avro-1.77.jar
  • avro-tools-1.77.jar
  • log4j-api-2.0-beta9.jar
  • og4j-core-2.0.beta9.jar.

Copy these files into a folder and set the clbumrout theree to the folder, in the ./bashrc file as shown below.

#clbum rout theree for Avro
exslot CLASSPATH=$CLASSPATH://home/Hadoop/Avro_Work/jars/*

Setting CLASSPATH

AVRO – Schemas

Avro, being a schema-based serialization utility, accepts schemas as inplace. In spite of various schemas being available, Avro follows it’s own standards of defining schemas. These schemas describe the folloearng details −

  • kind of file (record simply by default)
  • location of record
  • name of the record
  • fields in the record with their own corresponding data kinds

Using these schemas, you can store serialized values in binary format using less space. These values are storeddish coloureddish without there any kind of kind of metadata.

Creating Avro Schemas

The Avro schema is generated in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data intermodify format. It is generated in one of the folloearng ways −

  • A JSON string
  • A JSON object
  • A JSON array

Example − The given schema defines a (record kind) document wislim "Tutorialsstage" namespace. The name of document is "Employee" which contains two "Fields" → Name and Age.

{
   " kind " : "record",
   " namespace " : "Tutorialsstage",
   " name " : "Employee",
   " fields " : [
      { "name" : " Name" , "kind" : "string" },
      { "name" : "age" , "kind" : "int" }
   ]
}

We observed thead wear schema contains four attributes, they are shortly descrimattress below −

  • kind − Describes document kind, in this particular case a "record".

  • namespace − Describes the name of the namespace in which the object resides.

  • name − Describes the schema name.

  • fields − This is an attribute array which contains the folloearng −

    • name − Describes the name of field

    • kind − Describes data kind of field

Primitive Data Types of Avro

Avro schema is having primitive data kinds as well as complex data kinds. The folloearng table describes the primitive data kinds of Avro −

Data kind Description
null Null is a kind having no value.
int 32-bit signed integer.
durationy 64-bit signed integer.
float single precision (32-bit) IEEE 754 floating-stage number.
double double precision (64-bit) IEEE 754 floating-stage number.
simply bytes sequence of 8-bit unsigned simply bytes.
string Unicode charbehaveer sequence.

Complex Data Types of Avro

Adurationy with primitive data kinds, Avro provides six complex data kinds namely Records, Enums, Arrays, Maps, Unions, and Fixed.

Record

As we understand already simply by now, a record data kind in Avro is a collection of multiple attributes. It supslots the folloearng attributes −

  • name

  • namespace

  • kind

  • fields

Enum

An enumeration is a list of items in a collection, Avro enumeration supslots the folloearng attributes −

  • name − The value of this particular field holds the name of the enumeration.

  • namespace − The value of this particular field contains the string thead wear qualifies the name of the Enumeration.

  • symbols − The value of this particular field holds the enum's symbols as an array of names.

Example

Given below is the example of an enumeration.

{
   "kind" : "enum",
   "name" : "Numbers", "namespace": "data", "symbols" : [ " ONE ", " TWO " , " THREE ", " FOUR " ]
}

Arrays

This data kind defines an array field having a single attribute items. This items attribute specifies the kind of items in the array.

Example

{ " kind " : " array ", " items " : " int " }

Maps

The chart data kind is an array of key-value pairs. The values attribute holds the data kind of the content of chart. Avro chart values are implicitly taken as strings. The below example shows chart from string to int.

Example

{"kind" : "chart", "values" : "int"}

Unions

A union datakind is used whenever the field has one or more datakinds. They are represented as JSON arrays. For example, if a field thead wear could be possibly an int or null, then the union is represented as ["int", "null"].

Example

Given below is an example document using unions −

{ 
   "kind" : "record", 
   "namespace" : "tutorialsstage", 
   "name" : "empdetails ", 
   "fields" : 
   [ 
      { "name" : "encounter", "kind": ["int", "null"] }, { "name" : "age", "kind": "int" } 
   ] 
}

Fixed

This data kind is used to declare a fixed-dimensiond field thead wear can be used for storing binary data. It has field name and data as attributes. Name holds the name of the field, and dimension holds the dimension of the field.

Example

{ " kind " : " fixed " , " name " : " bdata ", " dimension " : 1048576}

AVRO – Reference API

In the previous chapter, we descrimattress the inplace kind of Avro, i.e., Avro schemas. In this particular chapter, we will exfundamental the clbumes and methods used in the serialization and deserialization of Avro schemas.

SpecificDatumWriter Clbum

This clbum mattressurationys to the package org.apache.avro.specific. It implements the DatumWriter interface which converts Java objects into an in-memory serialized format.

Constructor

S.No. Description
1 SpecificDatumWriter(Schema schema)

Method

S.No. Description
1

SpecificData getSpecificData()

Returns the SpecificData implementation used simply by this particular writer.

SpecificDatumReader Clbum

This clbum mattressurationys to the package org.apache.avro.specific. It implements the DatumReader interface which reads the data of a schema and determines in-memory data representation. SpecificDatumReader is the clbum which supslots generated java clbumes.

Constructor

S.No. Description
1

SpecificDatumReader(Schema schema)

Construct where the writer's and reader's schemas are the same.

Methods

S.No. Description
1

SpecificData getSpecificData()

Returns the contained SpecificData.

2

void setSchema(Schema behaveual)

This method is used to set the writer's schema.

DataFileWriter

Instantiates DataFileWrite for emp clbum. This clbum writes a sequence serialized records of data conforming to a schema, adurationy with the schema in a file.

Constructor

S.No. Description
1 DataFileWriter(DatumWriter<D> dout there)

Methods

S.No Description
1

void append(D datum)

Appends a datum to a file.

2

DataFileWriter<D> appendTo(File file)

This method is used to open a writer appending to an existing file.

Data FileReader

This clbum provides random access to files composed with DataFileWriter. It inherit’s the clbum DataFileStream.

Constructor

S.No. Description
1 DataFileReader(File file, DatumReader<D> reader))

Methods

S.No. Description
1

next()

Reads the next datum in the file.

2

Boolean hasNext()

Returns true if more entries remain in this particular file.

Clbum Schema.parser

This clbum is a parser for JSON-format schemas. It contains methods to parse the schema. It mattressurationys to org.apache.avro package.

Constructor

S.No. Description
1 Schema.Parser()

Methods

S.No. Description
1

parse (File file)

Parses the schema provided in the given file.

2

parse (InplaceStream in)

Parses the schema provided in the given InplaceStream.

3

parse (String s)

Parses the schema provided in the given String.

Interface GenricRecord

This withinterface provides methods to access the fields simply by name as well as index.

Methods

S.No. Description
1

Object get(String key)

Returns the value of a field given.

2

void place(String key, Object v)

Sets the value of a field given it’s name.

Clbum GenericData.Record

Constructor

S.No. Description
1 GenericData.Record(Schema schema)

Methods

S.No. Description
1

Object get(String key)

Returns the value of a field of the given name.

2

Schema getSchema()

Returns the schema of this particular instance.

3

void place(int i, Object v)

Sets the value of a field given it’s posit downion in the schema.

4

void place(String key, Object value)

Sets the value of a field given it’s name.

AVRO – Serialization By Generating Clbum

One can read an Avro schema into the program possibly simply by generating a clbum corresponding to a schema or simply by using the parsers library. This chapter describes how to read the schema simply by generating a clbum and serialize the data using Avro.

The folloearng is a depiction of serializing the data with Avro simply by generating a clbum. Here, emp.avsc is the schema file which we compallowe as inplace to Avro utility.

Avro WithCode Serializing

The out thereplace of Avro is a java file.

Serialization simply by Generating a Clbum

To serialize the data using Avro, follow the steps as given below −

  • Define an Avro schema.
  • Compile the schema using Avro utility. You get the Java code corresponding to thead wear schema.
  • Populate the schema with the data.
  • Serialize it using Avro library.

Defining a Schema

Suppose you like a schema with the folloearng details −

Field Name id age salary adgown
kind String int int int string

Create an Avro schema as shown below and save it as emp.avsc.

{
   "namespace": "tutorialsstage.com",
   "kind": "record",
   "name": "emp",
   "fields": [
      {"name": "name", "kind": "string"},
      {"name": "id", "kind": "int"},
      {"name": "salary", "kind": "int"},
      {"name": "age", "kind": "int"},
      {"name": "adgown", "kind": "string"}
   ]
}

Compiling the Schema

After creating the Avro schema, we need to compile it using Avro tools. Arvo tools can be located in avro-tools-1.7.7.jar file. We need to provide arvo-tools-1.7.7.jar file rout theree at compilation.

Syntax to Compile an Avro Schema

java -jar <rout theree/to/avro-tools-1.7.7.jar> compile schema <rout theree/to/schema-file> <destination-folder>

Open the terminal in the home folder. Create a brand new immediateory to work with Avro as shown below −

$ mkdir Avro_Work

In the brand newly generated immediateory, generate 3 sub-immediateories −

  • First named schema, to place the schema.

  • Second named with_code_gen, to place the generated code.

  • Third named jars, to place the jar files.

$ mkdir schema
$ mkdir with_code_gen
$ mkdir jars

The folloearng screenshot shows how your Avro_work folder need to look like after creating all the immediateories.

Avro Work

  • Now /home/Hadoop/Avro_work/jars/avro-tools-1.7.7.jar is the rout theree for the immediateory where you have downloaded avro-tools-1.7.7.jar file.

  • /home/Hadoop/Avro_work/schema/ is the rout theree for the immediateory where your schema file emp.avsc is storeddish coloureddish.

  • /home/Hadoop/Avro_work/with_code_gen is the immediateory where you like the generated clbum files to be storeddish coloureddish.

Compile the schema as shown below −

$ java -jar /home/Hadoop/Avro_work/jars/avro-tools-1.7.7.jar compile schema /home/Hadoop/Avro_work/schema/emp.avsc /home/Hadoop/Avro/with_code_gen

After this particular compilation, a package is generated in the destination immediateory with the name mentioned as namespace in the schema file. Wislim this particular package, the Java source file with schema name is generated. The generated file contains java code corresponding to the schema. This java file can be immediately accessed simply by an application.

In our example, a package/folder, named tutorialsstage is generated which contains one more folder named com (since the name space is tutorialsstage.com) and wislim it, resides the generated file emp.java. The folloearng snapshot shows emp.java

Snapshot of Sample Program

This java file is helpful to generate data according to schema.

The generated clbum contains −

  • Default constructor, and parameterized constructor which accept all the variables of the schema.
  • The setter and getter methods for all variables in the schema.
  • Get() method which returns the schema.
  • Builder methods.

Creating and Serializing the Data

First of all, duplicate the generated java file used in this particular project into the current immediateory or imslot it from where it is located.

Now we can write a brand new Java file and immediateiate the clbum in the generated file (emp) to add employee data to the schema.

Let us see the procedure to generate data according to the schema using apache Avro.

Step 1

Instantiate the generated emp clbum.

emp e1=brand new emp( );

Step 2

Using setter methods, insert the data of preliminary employee. For example, we have generated the details of the employee named Omar.

e1.setName("omar");
e1.setAge(21);
e1.setSalary(30000);
e1.setAdgown("Hyderabad");
e1.setId(001);

Similarly, fill in all employee details using setter methods.

Step 3

Create an object of DatumWriter interface using the SpecificDatumWriter clbum. This converts Java objects into in-memory serialized format. The folloearng example immediateiates SpecificDatumWriter clbum object for emp clbum.

DatumWriter<emp> empDatumWriter = brand new SpecificDatumWriter<emp>(emp.clbum);

Step 4

Instantiate DataFileWriter for emp clbum. This clbum writes a sequence serialized records of data conforming to a schema, adurationy with the schema it’self, in a file. This clbum requires the DatumWriter object, as a parameter to the constructor.

DataFileWriter<emp> empFileWriter = brand new DataFileWriter<emp>(empDatumWriter);

Step 5

Open a brand new file to store the data go withing to the given schema using generate() method. This method requires the schema, and the rout theree of the file where the data is to be storeddish coloureddish, as parameters.

In the folloearng example, schema is compalloweed using getSchema() method, and the data file is storeddish coloureddish in the rout theree − /home/Hadoop/Avro/serialized_file/emp.avro.

empFileWriter.generate(e1.getSchema(),brand new File("/home/Hadoop/Avro/serialized_file/emp.avro"));

Step 6

Add all the generated records to the file using append() method as shown below −

empFileWriter.append(e1);
empFileWriter.append(e2);
empFileWriter.append(e3);

Example – Serialization simply by Generating a Clbum

The folloearng compenablee program shows how to serialize data into a file using Apache Avro −

imslot java.io.File;
imslot java.io.IOException;

imslot org.apache.avro.file.DataFileWriter;
imslot org.apache.avro.io.DatumWriter;
imslot org.apache.avro.specific.SpecificDatumWriter;

public clbum Serialize {
   public static void main(String args[]) thlines IOException{
	
      //Instantiating generated emp clbum
      emp e1=brand new emp();
	
      //Creating values according the schema
      e1.setName("omar");
      e1.setAge(21);
      e1.setSalary(30000);
      e1.setAdgown("Hyderabad");
      e1.setId(001);
	
      emp e2=brand new emp();
	
      e2.setName("ram");
      e2.setAge(30);
      e2.setSalary(40000);
      e2.setAdgown("Hyderabad");
      e2.setId(002);
	
      emp e3=brand new emp();
	
      e3.setName("robbin");
      e3.setAge(25);
      e3.setSalary(35000);
      e3.setAdgown("Hyderabad");
      e3.setId(003);
	
      //Instantiate DatumWriter clbum
      DatumWriter<emp> empDatumWriter = brand new SpecificDatumWriter<emp>(emp.clbum);
      DataFileWriter<emp> empFileWriter = brand new DataFileWriter<emp>(empDatumWriter);
	
      empFileWriter.generate(e1.getSchema(), brand new File("/home/Hadoop/Avro_Work/with_code_gen/emp.avro"));
	
      empFileWriter.append(e1);
      empFileWriter.append(e2);
      empFileWriter.append(e3);
	
      empFileWriter.shut();
	
      System.out there.println("data successcompallowey serialized");
   }
}

Blinese through the immediateory where the generated code is placed. In this particular case, at home/Hadoop/Avro_work/with_code_gen.

In Terminal −

$ cd home/Hadoop/Avro_work/with_code_gen/

In GUI −

Generated Code

Now duplicate and save the above program in the file named Serialize.java and compile and execute it as shown below −

$ javac Serialize.java
$ java Serialize

Outplace

data successcompallowey serialized

If you verify the rout theree given in the program, you can find the generated serialized file as shown below.

Generated Serialized File

AVRO – Deserialization By Generating Clbum

As descrimattress earlayr, one can read an Avro schema into a program possibly simply by generating a clbum corresponding to the schema or simply by using the parsers library. This chapter describes how to read the schema simply by generating a clbum and Deserialize the data using Avro.

Deserialization simply by Generating a Clbum

In our final example, the serialized data was storeddish coloureddish in the file emp.avro. We shall now see how to deserialize it and read it using Avro. The procedure is as follows −

Step 1

Create an object of DatumReader interface using SpecificDatumReader clbum.

DatumReader<emp>empDatumReader = brand new SpecificDatumReader<emp>(emp.clbum);

Step 2

Instantiate DataFileReader clbum. This clbum reads serialized data from a file. It requires the DatumReader object, and rout theree of the file (emp.avro) where the serialized data is existing , as a parameters to the constructor.

DataFileReader<emp> dataFileReader = brand new DataFileReader(brand new File("/rout theree/to/emp.avro"), empDatumReader);

Step 3

Print the deserialized data, using the methods of DataFileReader.

  • The hasNext() method will return a boolean if generally there are any kind of kind of elements in the Reader.

  • The next() method of DataFileReader returns the data in the Reader.

while(dataFileReader.hasNext()){

   em=dataFileReader.next(em);
   System.out there.println(em);
}

Example – Deserialization simply by Generating a Clbum

The folloearng compenablee program shows how to deserialize the data in a file using Avro.

imslot java.io.File;
imslot java.io.IOException;

imslot org.apache.avro.file.DataFileReader;
imslot org.apache.avro.io.DatumReader;
imslot org.apache.avro.specific.SpecificDatumReader;

public clbum Deserialize {
   public static void main(String args[]) thlines IOException{
	
      //DeSerializing the objects
      DatumReader<emp> empDatumReader = brand new SpecificDatumReader<emp>(emp.clbum);
		
      //Instantiating DataFileReader
      DataFileReader<emp> dataFileReader = brand new DataFileReader<emp>(brand new
         File("/home/Hadoop/Avro_Work/with_code_genfile/emp.avro"), empDatumReader);
      emp em=null;
		
      while(dataFileReader.hasNext()){
      
         em=dataFileReader.next(em);
         System.out there.println(em);
      }
   }
}

Blinese into the immediateory where the generated code is placed. In this particular case, at home/Hadoop/Avro_work/with_code_gen.

$ cd home/Hadoop/Avro_work/with_code_gen/

Now, duplicate and save the above program in the file named DeSerialize.java. Compile and execute it as shown below −

$ javac Deserialize.java
$ java Deserialize

Outplace

{"name": "omar", "id": 1, "salary": 30000, "age": 21, "adgown": "Hyderabad"}
{"name": "ram", "id": 2, "salary": 40000, "age": 30, "adgown": "Hyderabad"}
{"name": "robbin", "id": 3, "salary": 35000, "age": 25, "adgown": "Hyderabad"}

AVRO – Serialization Using Parsers

One can read an Avro schema into a program possibly simply by generating a clbum corresponding to a schema or simply by using the parsers library. In Avro, data is always storeddish coloureddish with it’s corresponding schema. Therefore, we can always read a schema without there code generation.

This chapter describes how to read the schema simply by using parsers library and to serialize the data using Avro.

The folloearng is a depiction of serializing the data with Avro using parser libraries. Here, emp.avsc is the schema file which we compallowe as inplace to Avro utility.

Avro Without there Code Serialize

Serialization Using Parsers Library

To serialize the data, we need to read the schema, generate data according to the schema, and serialize the schema using the Avro API. The folloearng procedure serializes the data without there generating any kind of kind of code −

Step 1

First of all, read the schema from the file. To do so, use Schema.Parser clbum. This clbum provides methods to parse the schema in various formats.

Instantiate the Schema.Parser clbum simply by compalloweing the file rout theree where the schema is storeddish coloureddish.

Schema schema = brand new Schema.Parser().parse(brand new File("/rout theree/to/emp.avsc"));

Step 2

Create the object of GenericRecord interface, simply by immediateiating GenericData.Record clbum. This constructor accepts a parameter of kind Schema. Pbum the schema object generated in step 1 to it’s constructor as shown below −

GenericRecord e1 = brand new GenericData.Record(schema);

Step 3

Insert the values in the schema using the place() method of the GenericData clbum.

e1.place("name", "ramu");
e1.place("id", 001);
e1.place("salary",30000);
e1.place("age", 25);
e1.place("adgown", "chennai");

Step 4

Create an object of DatumWriter interface using the SpecificDatumWriter clbum. It converts Java objects into in-memory serialized format. The folloearng example immediateiates SpecificDatumWriter clbum object for emp clbum −

DatumWriter<emp> empDatumWriter = brand new SpecificDatumWriter<emp>(emp.clbum);

Step 5

Instantiate DataFileWriter for emp clbum. This clbum writes serialized records of data conforming to a schema, adurationy with the schema it’self, in a file. This clbum requires the DatumWriter object, as a parameter to the constructor.

DataFileWriter<emp> dataFileWriter = brand new DataFileWriter<emp>(empDatumWriter);

Step 6

Open a brand new file to store the data go withing to the given schema using generate() method. This method requires two parameters −

  • the schema,
  • the rout theree of the file where the data is to be storeddish coloureddish.

In the example given below, schema is compalloweed using getSchema() method and the serialized data is storeddish coloureddish in emp.avro.

empFileWriter.generate(e1.getSchema(), brand new File("/rout theree/to/emp.avro"));

Step 7

Add all the generated records to the file using append( ) method as shown below.

empFileWriter.append(e1);
empFileWriter.append(e2);
empFileWriter.append(e3);

Example – Serialization Using Parsers

The folloearng compenablee program shows how to serialize the data using parsers −

imslot java.io.File;
imslot java.io.IOException;

imslot org.apache.avro.Schema;
imslot org.apache.avro.file.DataFileWriter;

imslot org.apache.avro.generic.GenericData;
imslot org.apache.avro.generic.GenericDatumWriter;
imslot org.apache.avro.generic.GenericRecord;

imslot org.apache.avro.io.DatumWriter;

public clbum Seriali {
   public static void main(String args[]) thlines IOException{
	
      //Instantiating the Schema.Parser clbum.
      Schema schema = brand new Schema.Parser().parse(brand new File("/home/Hadoop/Avro/schema/emp.avsc"));
		
      //Instantiating the GenericRecord clbum.
      GenericRecord e1 = brand new GenericData.Record(schema);
		
      //Insert data according to schema
      e1.place("name", "ramu");
      e1.place("id", 001);
      e1.place("salary",30000);
      e1.place("age", 25);
      e1.place("adgown", "chenni");
		
      GenericRecord e2 = brand new GenericData.Record(schema);
		
      e2.place("name", "rahman");
      e2.place("id", 002);
      e2.place("salary", 35000);
      e2.place("age", 30);
      e2.place("adgown", "Delhi");
		
      DatumWriter<GenericRecord> datumWriter = brand new GenericDatumWriter<GenericRecord>(schema);
		
      DataFileWriter<GenericRecord> dataFileWriter = brand new DataFileWriter<GenericRecord>(datumWriter);
      dataFileWriter.generate(schema, brand new File("/home/Hadoop/Avro_work/without there_code_gen/mydata.txt"));
		
      dataFileWriter.append(e1);
      dataFileWriter.append(e2);
      dataFileWriter.shut();
		
      System.out there.println(“data successcompallowey serialized”);
   }
}

Blinese into the immediateory where the generated code is placed. In this particular case, at home/Hadoop/Avro_work/without there_code_gen.

$ cd home/Hadoop/Avro_work/without there_code_gen/

Without there Code Gen

Now duplicate and save the above program in the file named Serialize.java. Compile and execute it as shown below −

$ javac Serialize.java
$ java Serialize

Outplace

data successcompallowey serialized

If you verify the rout theree given in the program, you can find the generated serialized file as shown below.

Without there Code Gen1

AVRO – Deserialization Using Parsers

As descrimattress earlayr, one can read an Avro schema into a program possibly simply by generating a clbum corresponding to the schema or simply by using the parsers library. This chapter describes how to read the schema simply by using parser library and Deserialize the data using Avro.

Deserialization Using Parsers Library

In our final example, the serialized data was storeddish coloureddish in the file mydata.txt. We shall now see how to deserialize it and read it using Avro. The procedure is as follows −

Step 1

First of all, read the schema from the file. To do so, use Schema.Parser clbum. This clbum provides methods to parse the schema in various formats.

Instantiate the Schema.Parser clbum simply by compalloweing the file rout theree where the schema is storeddish coloureddish.

Schema schema = brand new Schema.Parser().parse(brand new File("/rout theree/to/emp.avsc"));

Step 2

Create an object of DatumReader interface using SpecificDatumReader clbum.

DatumReader<emp>empDatumReader = brand new SpecificDatumReader<emp>(emp.clbum);

Step 3

Instantiate DataFileReader clbum. This clbum reads serialized data from a file. It requires the DatumReader object, and rout theree of the file where the serialized data exists, as a parameters to the constructor.

DataFileReader<GenericRecord> dataFileReader = brand new DataFileReader<GenericRecord>(brand new File("/rout theree/to/mydata.txt"), datumReader);

Step 4

Print the deserialized data, using the methods of DataFileReader.

  • The hasNext() method returns a boolean if generally there are any kind of kind of elements in the Reader.

  • The next() method of DataFileReader returns the data in the Reader.

while(dataFileReader.hasNext()){

   em=dataFileReader.next(em);
   System.out there.println(em);
}

Example – Deserialization Using Parsers Library

The folloearng compenablee program shows how to deserialize the serialized data using Parsers library −

public clbum Deserialize {
   public static void main(String args[]) thlines Exception{
	
      //Instantiating the Schema.Parser clbum.
      Schema schema = brand new Schema.Parser().parse(brand new File("/home/Hadoop/Avro/schema/emp.avsc"));
      DatumReader<GenericRecord> datumReader = brand new GenericDatumReader<GenericRecord>(schema);
      DataFileReader<GenericRecord> dataFileReader = brand new DataFileReader<GenericRecord>(brand new File("/home/Hadoop/Avro_Work/without there_code_gen/mydata.txt"), datumReader);
      GenericRecord emp = null;
		
      while (dataFileReader.hasNext()) {
         emp = dataFileReader.next(emp);
         System.out there.println(emp);
      }
      System.out there.println("hello");
   }
}

Blinese into the immediateory where the generated code is placed. In this particular case, it is at home/Hadoop/Avro_work/without there_code_gen.

$ cd home/Hadoop/Avro_work/without there_code_gen/

Now duplicate and save the above program in the file named DeSerialize.java. Compile and execute it as shown below −

$ javac Deserialize.java
$ java Deserialize

Outplace

{"name": "ramu", "id": 1, "salary": 30000, "age": 25, "adgown": "chennai"}
{"name": "rahman", "id": 2, "salary": 35000, "age": 30, "adgown": "Delhi"}
SHARE
Previous articleR
Next articleFreestyle Skiing

NO COMMENTS

LEAVE A REPLY