Monday, 22 June 2015

Insight about HCATALOG

Data analyst uses multiple tools for processing BIG data such as Map Reduce(JAVA), PIG, HIVE etc. In real world environment, Map reduce output can be processed by PIG and HIVE or vice-versa. Before processing, analyst should know the data location(file path), format and schema. They also uses different set of data format such as CSV, JSON, AVRO, ORC, HDFS files, HIVE tables as part of data processing.

Hive uses meta store in order to read data location, format and schema. PIG defines them as part of the script and Map Reduce encodes them as part of the application code. For Map reduce and PIG application, it is very difficult to maintain the meta data information. HCatalog is aiming to provide solution for this scenario.

HCatalog is a metadata and table management system for the HADOOP environment. It is a sub-module of HIVE.

Let us see how it works.

Upload a employee file to HDFS

>> hdfs dfs -put employee /data


Lets create schema for this input file in HIVE.

>> create external table employee(Name string, Salary int, Location string) 
LOCATION '/data/employee';

Add the data set to HCatalog by running below command.

>>  hcat -f employee.hcatalog

Once it is added, we can verify as below.

>> hcat -e "describe employee"
Name     string     None
Salary    int          None
Location string      None

Now HCatalog is maintaining our "employee" data set's location ,schema and format. So it can be accessed through HCatalog interface from PIG, Map Reduce application.

HCatalog Interface for PIG:

>> result = LOAD 'employee' USING org.apache.hcatalog.pig.HCatLoader();
>> DUMP result;

It can also be used as part of PIG Script execution as below:

>> pig -useHCatalog Test.pig

There are two PIG interfaces are available.

1. HCatLoader - To read data from a data set.
2. HCatStorer  - To write data to a data set.

>> result = LOAD 'employee' using HCatLoader();
>> STORE result into 'emp_result' using HCatStorer('date=20150622');

HCatStorer also helps to provide the partition keys as mentioned above. It is possible to write to a single partition or multiple partition.

HCatalog Interface for Map Reduce:

It consists of below mentioned two interfaces.

1. HCatInputFormat - It accepts a data set to read.
2. HCatOutputFormat - It accepts a data set to write.

HCatalog Interface for HIVE:

There are no Hive specific interfaces for HCatalog as it a sub module of HIVE. 
Hive can read information from HCatalog directly.

HCatalog Notifications:

HCatalog provides notification activity by using Oozie or custom Java code through which we can process a data set as soon as it is available.

WebHCat Server:

It provides a REST-like web API for HCatalog. It sends request to get information about data sets and also sends request to run PIG or HIVE scripts.

>> curl -s 'http://hostname:port/templeton/v1/ddl/database/db_name/table/table-name?'