You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Hive supports both partitioned and unpartitioned external tables. In both cases, when a new table/partition is being added, the location is also specified for the new table/partition. Let us consider a specific example:

create table T (key string, value string) partitioned by (ds string, hr string);
insert overwrite table T partition (ds='1', hr='1') ...;
..
insert overwrite table T partition (ds='1', hr='24') ...;

T is a partitioned table by date and hour, and Tsignal is an external table which conceptually denotes the creation of the signal table.

create external table Tsignal (key string, value string) partitioned by (ds string);

When all the hourly partitions are created for a day (ds='1'), the corresponding partition can be added to Tsignal

alter table Tsignal add partition (ds='1') location 'Location of T'/ds=1;

There is a implicit dependency between Tsignal@ds=1 and T@ds=1/hr=1, T@ds=1/hr=2, .... T@ds=1/hr=24, but that dependency is not captured anywhere
in the metastore. It would be useful to have an ability to explicitly create that dependency. This dependency can be used for all kinds of auditing purposes. For eg. when the following query is performed:

select .. from Tsignal where ds = '1';

the inputs only contains Tsignal@ds=1, but is should also contain T@ds=1/hr=1, T@ds=1/hr=2,....T@ds=1/hr=24

This dependency should be captured by the metastore. For simplicity, let us assume we create a new notion of dependent tables (instead of overloading external tables).

create dependency table Tdependent (key string, value string) partitioned by (ds string);

This is like a external table but also captures the dependency (we can also enhance external tables for the same).

alter table Tdependent add partition (ds='1') location '/T/ds=1' dependent partitions table T partitions (ds='1');
specify the partial partition spec for the dependent partitions.
Note that each table can point to different locations - hive needs to ensure that all the dependent partitions are under the location 'T/ds=1'

The metastore can store the dependencies completely or partially.

  • Materialize the dependencies both-ways
    Tdependent@ds=1 depends on T@ds=1/hr=1 to T@ds=1/hr=24
  • No labels