Data analyst uses multiple tools for processing BIG data such as Map Reduce(JAVA), PIG, HIVE etc. In real world environment, Map reduce output can be processed by PIG and HIVE or vice-versa. Before processing, analyst should know the data location(file path), format and schema. They also uses different set of data format such as CSV, JSON, AVRO, ORC, HDFS files, HIVE tables as part of data processing.
Hive uses meta store in order to read data location, format and schema. PIG defines them as part of the script and Map Reduce encodes them as part of the application code. For Map reduce and PIG application, it is very difficult to maintain the meta data information. HCatalog is aiming to provide solution for this scenario.
HCatalog is a metadata and table management system for the HADOOP environment. It is a sub-module of HIVE.
Let us see how it works.
Upload a employee file to HDFS
>> hdfs dfs -put employee /data
swetha,250000,Chennai
anamika,200000,Kanyakumari
tarun,300000,Pondi
anita,250000,Salem
Upload a employee file to HDFS
>> hdfs dfs -put employee /data
swetha,250000,Chennai
anamika,200000,Kanyakumari
tarun,300000,Pondi
anita,250000,Salem
Lets create schema for this input file in HIVE.
>> create external table employee(Name string, Salary int, Location string)
ROW FORMAT DELIMITED FIELDS TERMINATED By ','
STORED AS TEXTFILE
LOCATION '/data/employee';
Add the data set to HCatalog by running below command.
>> hcat -f employee.hcatalog
Once it is added, we can verify as below.
>> hcat -e "describe employee"
OK
Name string None
Salary int None
Location string None
Now HCatalog is maintaining our "employee" data set's location ,schema and format. So it can be accessed through HCatalog interface from PIG, Map Reduce application.
HCatalog Interface for PIG:
>> result = LOAD 'employee' USING org.apache.hcatalog.pig.HCatLoader();
>> DUMP result;
It can also be used as part of PIG Script execution as below:
>> pig -useHCatalog Test.pig
There are two PIG interfaces are available.
1. HCatLoader - To read data from a data set.
2. HCatStorer - To write data to a data set.
>> result = LOAD 'employee' using HCatLoader();
>> STORE result into 'emp_result' using HCatStorer('date=20150622');
HCatStorer also helps to provide the partition keys as mentioned above. It is possible to write to a single partition or multiple partition.
HCatalog Interface for Map Reduce:
It consists of below mentioned two interfaces.
1. HCatInputFormat - It accepts a data set to read.
2. HCatOutputFormat - It accepts a data set to write.
HCatalog Interface for HIVE:
There are no Hive specific interfaces for HCatalog as it a sub module of HIVE.
Hive can read information from HCatalog directly.
HCatalog Notifications:
HCatalog provides notification activity by using Oozie or custom Java code through which we can process a data set as soon as it is available.
WebHCat Server:
It provides a REST-like web API for HCatalog. It sends request to get information about data sets and also sends request to run PIG or HIVE scripts.
>> curl -s 'http://hostname:port/templeton/v1/ddl/database/db_name/table/table-name?user.name=hcatuser'
No comments:
Post a Comment