KIP 10 refactor hive and hadoop dependency

参考 issue: KYLIN-5069 - Getting issue details... STATUS

01 Background

At present, kylin 4.0 still needs to obtain hive meta information through hiveclient to load hive table. When loading hive table, you need to get $hive from_ Hide under home / Lib_ Dependency is loaded into kylin environment. Through user feedback, it is found that due to different versions of hive used by users, hive_ The dependencies are also different. Class conflicts often occur when loading hive table.

In addition, kylin will load all classes of Hadoop classpath into the environment, and then filter and load all dependencies in the kylin environment through sparkclassloader when kylin is running. The filtered classpath is used as the environment for starting sparder. This process can be simplified to only load the required classes into the kylin environment, Remove the class loading process of sparkclassloader.

To solve such problems, we plan to uniformly manage the process of loading dependency through spark:

Remove hive dependency from kylin 4.0 and use sparksession to obtain hive meta information.
Sort out the Hadoop classpath, load only the Hadoop related jar packages really needed by kylin 4.0 into kylin 4.0 environment, and remove the sparkclassloader.

02 Dev Design

What needs to be done is as follows:

1. Remove the process of loading hive dependency from kylin startup script kylin.sh;

2. To avoid adding all jar packages in the hadoop classpath to the classpath in the kylin startup script kylin.sh, sort and filter the jar packages under hadoop lib, and copy the required jar packages to ${SPARK_HOME}/jars directory (only when ${SPARK_HOME} path is ${KYLIN_HOME}/spark);

3. Modify the classpath to be loaded by kylin in kylin.sh: the previous classpath includes kylin server classpath, ${KYLIN_HOME}/conf, ${KYLIN_HOME}/lib/ *, ${KYLIN_HOME}/ext/ *, hadoop classpath and hive classpath. The modified class path only includes kylin server classpath, ${KYLIN_HOME}/conf, ${KYLIN_HOME}/lib/ *, ${KYLIN_HOME}/ext/ *, $ {KYLIN_HOME}/hadoop_conf/ *, ${SPARK_HOME}/jars / *. The previous hadoop classpath and hive classpath are replaced by ${SPARK_HOME}/jars/*;

4. Inherit IHiveClient interface to implement SparkHiveClient class, and use SparkSession to implement its methods;

5. Replace the original CliHiveclient/BeelineHiveClient class in kylin 4.0 with SparkHiveClient class;

6. Clean up relevant useless codes.

03 Configuration Change

kylin.source.hive.client：The original default value is "cli", which can be configured as "cli" and "beeline"; After modification, the default value is "spark_catalog"。 Users who used "cli" and "beeline" changed to "spark_ catalog" to access hive meta.

04 Test

After the code is completed, compatibility tests are carried out in various environments supported by kylin4, mainly testing the construction, query and load hive table. Finally, it passed the test in the following environments:

Hadoop Distribution	Spark	Hadoop	Hive	Cluster Manager	Distributed Filesystem	Verified ?	Comment
CDH 5.7	2.4.7/3.1.1	2.6.0-cdh5.7.6	1.1.0-cdh5.7.6	YARN	HDFS	verified
HDP 2.4	2.4.7/3.1.1	2.7.1.2.4.0.0-16	1.2.1000.2.4.0.0-16	YARN	HDFS	verified
AWS EMR 5.33.0	2.4.7/3.1.1	2.10.1-amzn-1	Hive 2.3.7-amzn-4	YARN	HDFS/S3	verified
CDH 6.2.0	2.4.7/3.1.1	3.0.0-cdh6.2.0	2.1.1-cdh6.2.0	YARN	HDFS	verified	You need to prepare the jar package and put it in the specified directory: Deploy Kylin 4 on CDH 6
AWS EMR 6.3.0	3.1.1	3.2.1-amzn-3	3.1.2-amzn-4	YARN	HDFS/S3	verified
Apache	3.1.1	3.2.0	2.3.9	YARN, Standalone	S3	verified	http://kylin.apache.org/docs40/install/deploy_without_hadoop.html

Space shortcuts

Page tree

01 Background

02 Dev Design

03 Configuration Change

04 Test

2 Comments

Xiaoxiang Yu

Xiaoxiang Yu