Design Proposal of Kite Connector
Kite SDK is an open source set of libraries for building data-oriented systems and applications. With the Kite dataset API, you can perform tasks such as reading a dataset, defining and reading views of a dataset and using MapReduce to process a dataset.
Recently Sqoop 1 has supported Parquet file format for HDFS/Hive using Kite SDK (SQOOP-1366). The JIRA (SQOOP-1529) proposes to create a Kite dataset connector for Sqoop 2, which is able to access HDFS and Hive dataset. The behavior is expected similar to what Kite CLI does.
- Ability to write to a new HDFS/Hive dataset by choosing Kite connector in diverse file storage formats (Avro, Parquet and experimentally CSV) and compression codecs (Uncompressed, Snappy, Deflate, etc.).
- Ability to read an entire HDFS/Hive dataset by choosing Kite connector.
- Ability to indicate the partition strategy.
- Ability to support delta writes to HDFS/Hive dataset.
- Ability to read partially from an HDFS/Hive dataset with constraints.
- Config objects:
- ToJobConfig includes arguments that Kite CLI provides for import.
- Dataset uri is mandatory.
- Output Storage format (Enum: Avro, Parquet or experimentally CSV) is mandatory.
- Compression Codec (Enum: Default, Avro or Deflate) is optional (No JIRA yet)
- Path to a JSON file which defines partition strategy is optional (No JIRA yet)
- User input validation will happen in-place.
- FromJobConfig includes arguments that Kite CLI provides for export.
- Dataset uri is mandatory.
- User input validation will happen in-place.
- LinkConfig intends to store credential properties.
- E.g. the host and port of namenode, the host and port of hive metastore. Imagine we build a role based access control. User is able to access particular ToJobConfig and FromJobConfig, but only admin is able to access the LinkConfig. Admin does not want user to know/change the address of namenode, so LinkConfig is the right place to put credential properties.
- SQOOP-1751 has some discussion about that
- ToJobConfig includes arguments that Kite CLI provides for import.
- Write data into a new dataset:
- The job will fail, if target dataset exists.
- Every KiteDatasetLoader will create a temporary dataset and write data into it. The name of temporary dataset is expected to be unique and new.
- If the job is done successfully, all temporary datasets will be merged as one.
- If the job is failed, all temporary datasets will be removed.
- As Kite uses Avro, data records will be converted from Sqoop objects (FixedPoint, Text, etc.) to Avro objects. (See also future work #3)
- Read data from a dataset:
- The job will fail, if target dataset does not exist or it is not accessible.
- Every KiteDatasetPartition should contain partition strategy information. If it is not specified, there will be only one partition.
- Every KiteDatasetExtractor will read data from its partition.
- If error is occurred during reading, SqoopException will be thrown.
- As Kite uses Avro, data records will be converted from Avro to Sqoop objects (FixedPoint, Text, etc.) (See also future work #3)
- Partition strategy handling:
- For writing data, if no partition strategy is specified, the dataset will be unpartitioned.
- For reading data, if given dataset has a partition strategy, it should be used.
- Reference:
- Incremental Import:
- A ToJobConfig property "bool: AppendMode" is required.
- If target dataset does not exist, it will fail.
- If target dataset exists, the implementation details will check dataset metadata (e.g. schema, partition strategy) defensively.
- It will only append records to existing dataset. If it is failed due to a duplicate, we do not handle.
- The most implementation should follow section 2.
- Read data from a dataset with constraints:
- A FromJobConfig property "str: Constraint" is required.
- Build a view query to read data.
- The most implementation should follow section 3.
- Unit testing to ensure the correctness of utils.
- Integration testing to ensure data can be moved from JDBC to HDFS/Hive.
Future Work
- As Sqoop 2 does not allow to specify InputFormat and OutputFormat, data reading can be Inefficient as we cannot create concurrent data readers, especially for a un-partitioned dataset. Still need some investigation with Kite team for a solution.
- HBase support (SQOOP-1744) will be an individual improvement to the original design.
- The current implementation uses the default IDF class (CSVIDF) for data conversion. Recently we have introduced AvroIDF. As Kite uses Avro internally, it makes sense to use AvroIDF instead of CSVIDF. This will involve two things:
- Clean up AvroTypeUtil and KiteDataTypeUtil.
- AvroIDF will be responsible to convert every Sqoop data type (FixedPoint, Text, etc.) to corresponding Avro representation.
- (VB) : The complex types array/ map/ enum are not supported in the current design/implementation.
- CSV format for HDFS-write via KiteConnector only supports "primitive types" since it is experimentally supported in Kite SDK
- The design details of Delta Write in Kite-HDFS is not included in this wiki, another design wiki will be added for SQOOP-1999
Veena Basavaraj
Finally we have a doc.
Here are some things I am hoping to get more details on and please add these to the wiki so it is more detailed and complete since we already have the implementation in place
>>>The fault handling is not an obligation of Kite connector in read mode.
I would put the Hbase support as a limitation. since it was not even part of the requirement, so It is odd to call it a limitation
Regarding Delta Writing :
Veena Basavaraj
One more question, UNION type in avro to sqoop type, Is this the best way to handle it? Should not be an alternative to add such a type in sqoop?
How do we assume there is only first and second type, cant there be more than that in a Avro record?
Veena Basavaraj
NOTE: Kite SDK support for formats
Qian please fill in the blanks when the Hive implementation is ready. We want to to state what is the current limitation of Kite connector and what is not.
Yes all types
only primitive types are supported
( such as int, float, string?)