Design Proposal of Kite Connector

Background

Kite SDK is an open source set of libraries for building data-oriented systems and applications. With the Kite dataset API, you can perform tasks such as reading a dataset, defining and reading views of a dataset and using MapReduce to process a dataset.

Recently Sqoop 1 has supported Parquet file format for HDFS/Hive using Kite SDK (SQOOP-1366). The JIRA (SQOOP-1529) proposes to create a Kite dataset connector for Sqoop 2, which is able to access HDFS and Hive dataset. The behavior is expected similar to what Kite CLI does.

Requirements

Ability to write to a new HDFS/Hive dataset by choosing Kite connector in diverse file storage formats (Avro, Parquet and experimentally CSV) and compression codecs (Uncompressed, Snappy, Deflate, etc.).
Ability to read an entire HDFS/Hive dataset by choosing Kite connector.
Ability to indicate the partition strategy.
Ability to support delta writes to HDFS/Hive dataset.
Ability to read partially from an HDFS/Hive dataset with constraints.

Design

Config objects:
- ToJobConfig includes arguments that Kite CLI provides for import.
  1. Dataset uri is mandatory.
  2. Output Storage format (Enum: Avro, Parquet or experimentally CSV) is mandatory.
  3. Compression Codec (Enum: Default, Avro or Deflate) is optional (No JIRA yet)
  4. Path to a JSON file which defines partition strategy is optional (No JIRA yet)
  5. User input validation will happen in-place.
- FromJobConfig includes arguments that Kite CLI provides for export.
  1. Dataset uri is mandatory.
  2. User input validation will happen in-place.
- LinkConfig intends to store credential properties.
  1. E.g. the host and port of namenode, the host and port of hive metastore. Imagine we build a role based access control. User is able to access particular ToJobConfig and FromJobConfig, but only admin is able to access the LinkConfig. Admin does not want user to know/change the address of namenode, so LinkConfig is the right place to put credential properties.
  2. SQOOP-1751 has some discussion about that
Write data into a new dataset:
- The job will fail, if target dataset exists.
- Every KiteDatasetLoader will create a temporary dataset and write data into it. The name of temporary dataset is expected to be unique and new.
- If the job is done successfully, all temporary datasets will be merged as one.
- If the job is failed, all temporary datasets will be removed.
- As Kite uses Avro, data records will be converted from Sqoop objects (FixedPoint, Text, etc.) to Avro objects. (See also future work #3)
Read data from a dataset:
- The job will fail, if target dataset does not exist or it is not accessible.
- Every KiteDatasetPartition should contain partition strategy information. If it is not specified, there will be only one partition.
- Every KiteDatasetExtractor will read data from its partition.
- If error is occurred during reading, SqoopException will be thrown.
- As Kite uses Avro, data records will be converted from Avro to Sqoop objects (FixedPoint, Text, etc.) (See also future work #3)
Partition strategy handling:
- For writing data, if no partition strategy is specified, the dataset will be unpartitioned.
- For reading data, if given dataset has a partition strategy, it should be used.
- Reference: http://kitesdk.org/docs/0.17.1/Partitioned-Datasets.html
Incremental Import:
- A ToJobConfig property "bool: AppendMode" is required.
- If target dataset does not exist, it will fail.
- If target dataset exists, the implementation details will check dataset metadata (e.g. schema, partition strategy) defensively.
- It will only append records to existing dataset. If it is failed due to a duplicate, we do not handle.
- The most implementation should follow section 2.
Read data from a dataset with constraints:
- A FromJobConfig property "str: Constraint" is required.
- Build a view query to read data.
- The most implementation should follow section 3.

Testing

Unit testing to ensure the correctness of utils.
Integration testing to ensure data can be moved from JDBC to HDFS/Hive.

Future Work

As Sqoop 2 does not allow to specify InputFormat and OutputFormat, data reading can be Inefficient as we cannot create concurrent data readers, especially for a un-partitioned dataset. Still need some investigation with Kite team for a solution.
HBase support (SQOOP-1744) will be an individual improvement to the original design.
The current implementation uses the default IDF class (CSVIDF) for data conversion. Recently we have introduced AvroIDF. As Kite uses Avro internally, it makes sense to use AvroIDF instead of CSVIDF. This will involve two things:
1. Clean up AvroTypeUtil and KiteDataTypeUtil.
2. AvroIDF will be responsible to convert every Sqoop data type (FixedPoint, Text, etc.) to corresponding Avro representation.
(VB) : The complex types array/ map/ enum are not supported in the current design/implementation.
CSV format for HDFS-write via KiteConnector only supports "primitive types" since it is experimentally supported in Kite SDK
The design details of Delta Write in Kite-HDFS is not included in this wiki, another design wiki will be added for SQOOP-1999

	TEXT	AVRO
HDFS FROM	?	?
HDFS TO
HIVE FROM
HIVE TO	?

	Text/ CSV	Avro supported	Parquet suported
HDFS- read	Yes all types	Yes all types	Yes all types
HDFS- write	only primitive types are supported ( such as int, float, string?)	Yes all types	Yes all types
HIVE -read	??	??	??
HIVE -write	??	??	??

Child pages

Design Proposal of Kite Connector

Background

Requirements

Design

Testing

Future Work

3 Comments

Veena Basavaraj

Veena Basavaraj

Veena Basavaraj

Child pages

Kite Connector Design

Design Proposal of Kite Connector

Background

Requirements

Design

Testing

Future Work

3 Comments

Veena Basavaraj

Veena Basavaraj

Veena Basavaraj