|Table of Contents|
HCatalog graduated from the Apache incubator and merged with the Hive project on March 26, 2013.
Joe in data acquisition uses
distcp to get data onto the grid.
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, ...); B = filter A by bot_finder(zeta) = 0; ... store Z into 'data/processedevents/20100819/data';
With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
A = load 'rawevents' using org.apache.hive.hcatalog.pig.HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; ... store Z into 'processedevents' using org.apache.hive.hcatalog.pig.HCatStorer("date=20100819");
Without HCatalog, Robert must alter the table to add the required partition.
alter table processedevents add partition 20100819 hdfs://data/processedevents/20100819/data select advertiser_id, count(clicks) from processedevents where date = '20100819' group by advertiser_id;
With HCatalog, Robert does not need to modify the table structure.
select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id;
WebHCat is a REST API for HCatalog. (REST stands for "representational state transfer", a style of API based on HTTP verbs). The original name of WebHCat was Templeton. For more information, see the WebHCat manual.
Next: HCatalog Installation