Design Proposal of Kafka Connector (From side)

Background

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.

Sqoop 2 will support Kafka connector (SQOOP-1851). Currently, Kafak connector (To side) is supported (SQOOP-1852). This design doc is for Kafka connector (From side) (SQOOP-1853)

Requirements

Ability to read from a Kafka server as a consumer.
Support partition reading. (SQOOP-2434) As partition is supported in Kafka, this could be used in Sqoop as partition.
Support incremental reading. (SQOOP-2435) The offset will be 0 at the first time. User could set the offset in job from config.
Multi-topic will be an improvement.
Support CSV format. Other data format (Avro/Parquet) support will be an improvement.

Design

Basic function:
- FromJobConfig includes arguments.
  1. From topic is mandatory.
  2. Offset is optional, which is used in incremental supporting.
- KafkaPartition includes arguments.
  1. Partition is mandatory. The Sqoop partition will split according to Kafka partition. In multi-topic supporting, topic+partition is mandatory.
Write data into a new dataset:
- The job will fail, if target dataset exists.
- Every KiteDatasetLoader will create a temporary dataset and write data into it. The name of temporary dataset is expected to be unique and new.
- If the job is done successfully, all temporary datasets will be merged as one.
- If the job is failed, all temporary datasets will be removed.
- As Kite uses Avro, data records will be converted from Sqoop objects (FixedPoint, Text, etc.) to Avro objects. (See also future work #3)
Read data from a dataset:
- The job will fail, if target dataset does not exist or it is not accessible.
- Every KiteDatasetPartition should contain partition strategy information. If it is not specified, there will be only one partition.
- Every KiteDatasetExtractor will read data from its partition.
- If error is occurred during reading, SqoopException will be thrown.
- As Kite uses Avro, data records will be converted from Avro to Sqoop objects (FixedPoint, Text, etc.) (See also future work #3)
Partition strategy handling:
- For writing data, if no partition strategy is specified, the dataset will be unpartitioned.
- For reading data, if given dataset has a partition strategy, it should be used.
- Reference: http://kitesdk.org/docs/0.17.1/Partitioned-Datasets.html
Incremental Import:
- A ToJobConfig property "bool: AppendMode" is required.
- If target dataset does not exist, it will fail.
- If target dataset exists, the implementation details will check dataset metadata (e.g. schema, partition strategy) defensively.
- It will only append records to existing dataset. If it is failed due to a duplicate, we do not handle.
- The most implementation should follow section 2.
Read data from a dataset with constraints:
- A FromJobConfig property "str: Constraint" is required.
- Build a view query to read data.
- The most implementation should follow section 3.

Testing

Unit testing to ensure the correctness of utils.
Integration testing to ensure data can be moved from JDBC to HDFS/Hive.

Future Work

As Sqoop 2 does not allow to specify InputFormat and OutputFormat, data reading can be Inefficient as we cannot create concurrent data readers, especially for a un-partitioned dataset. Still need some investigation with Kite team for a solution.
HBase support (SQOOP-1744) will be an individual improvement to the original design.
The current implementation uses the default IDF class (CSVIDF) for data conversion. Recently we have introduced AvroIDF. As Kite uses Avro internally, it makes sense to use AvroIDF instead of CSVIDF. This will involve two things:
1. Clean up AvroTypeUtil and KiteDataTypeUtil.
2. AvroIDF will be responsible to convert every Sqoop data type (FixedPoint, Text, etc.) to corresponding Avro representation.
(VB) : The complex types array/ map/ enum are not supported in the current design/implementation.
CSV format for HDFS-write via KiteConnector only supports "primitive types" since it is experimentally supported in Kite SDK
The design details of Delta Write in Kite-HDFS is not included in this wiki, another design wiki will be added for SQOOP-1999

Child pages

Kafka Connector (From side) Design