Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Page History
Date | Author | Comment |
---|---|---|
2021-01-06 | xxyu@apache.org | Create for Kylin 4.0.0-beta. |
NOTES:
If you are using Apache Kylin 4.0.1 and your $SPARK_HOME points to $KYLIN_HOME/spark, then you can ignore this document.
Kylin 4.0.1 will help you do the jar package replacement described in this document. You don't need to do these steps to start Kylin 4.0.1.
However, Kylin 4.0.1's automatic replacement of jar packages may fail.
If you encounter problems such as ClassNotFound during use Kylin 4.0.1, you still need to refer to this document to manually replace some jar packages.
Kylin on EMR 5.31
Create a EMR cluster
# Create a EMR cluster $ aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Pig Name=Spark Name=Sqoop Name=Tez Name=ZooKeeper \ --release-label emr-5.31.0 \ --ec2-attributes '{"KeyName":"XiaoxiangYu","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-XXX","EmrManagedSlaveSecurityGroup":"XXX","EmrManagedMasterSecurityGroup":"XXX"}' \ --log-uri 's3n://aws-logs-XXX/elasticmapreduce/xiaoxiangyu' \ --instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":100,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"m5.xlarge","Configurations":[{"Classification":"hive-site","Properties":{"hive.optimize.sort.dynamic.partition":"false"}}],"Name":"Master Node"},{"InstanceCount":2,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":50,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"m5.xlarge","Configurations":[{"Classification":"hive-site","Properties":{"hive.optimize.sort.dynamic.partition":"false"}}],"Name":"Worker Node"}]' \ --configurations '[{"Classification":"mapred-site","Properties":{"mapreduce.map.memory.mb":"3072","mapreduce.reduce.memory.mb":"6144","mapreduce.map.java.opts":"-Xmx2458m","mapreduce.reduce.java.opts":"-Xmx4916m"}},{"Classification":"yarn-site","Properties":{"yarn.nodemanager.resource.cpu-vcores":"4","yarn.nodemanager.resource.memory-mb":"12288","yarn.scheduler.maximum-allocation-mb":"12288","yarn.app.mapreduce.am.resource.mb":"6144"}},{"Classification":"emrfs-site","Properties":{"fs.s3.consistent":"false"}}]' --auto-scaling-role EMR_AutoScaling_DefaultRole \ --ebs-root-volume-size 50 --service-role EMR_DefaultRole --enable-debugging --scale-down-behavior TERMINATE_AT_TASK_COMPLETION \ --name 'OSS-Dev-Cluster' \ --region XXX # Login into master node $ ssh -i ~/XXX.pem hadoop@ec2-XXX.compute.amazonaws.com.cn
Check Hadoop version and download Kylin and Spark
[root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# hadoop version Hadoop 2.10.0-amzn-0 Subversion git@aws157git.com:/pkg/Aws157BigTop -r d1e860a34cc1aea3d600c57c5c0270ea41579e8c Compiled by ec2-user on 2020-09-19T02:05Z Compiled with protoc 2.5.0 From source with checksum 61f0bc74ab37bcbfbc09b3846ee32b This command was run using /usr/lib/hadoop/hadoop-common-2.10.0-amzn-0.jar [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# hive --version Hive 2.3.7-amzn-1 Git git://ip-10-0-0-57/workspace/workspace/bigtop.release-rpm-5.31.0/build/hive/rpm/BUILD/apache-hive-2.3.7-amzn-1-src -r d1e860a34cc1aea3d600c57c5c0270ea41579e8c Compiled by ec2-user on Sat Sep 19 02:48:49 UTC 2020 From source with checksum b7d9cc83f78a0b3e0f2b22c78e54aae1 [root@ip-172-31-1-253 hadoop]# aws s3 cp s3://XXX/xxyu_upload/apache-kylin-4.0.0-SNAPSHOT-bin.tar-b08c1be22eb51796fb58c3694e86e60f948337f6.gz . download: s3://xiaoxiang-yu/xxyu_upload/apache-kylin-4.0.0-SNAPSHOT-bin.tar-b08c1be22eb51796fb58c3694e86e60f948337f6.gz to ./apache-kylin-4.0.0-SNAPSHOT-bin.tar-b08c1be22eb51796fb58c3694e86e60f948337f6.gz [root@ip-172-31-1-253 hadoop]# tar zxf apache-kylin-4.0.0-SNAPSHOT-bin.tar-b08c1be22eb51796fb58c3694e86e60f948337f6.gz [root@ip-172-31-1-253 hadoop]# cd apache-kylin-4.0.0-SNAPSHOT-bin/ [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# aws s3 cp s3://XXX/xxyu_upload/spark-2.4.6-bin-hadoop2.7.tgz . download: s3://XXX/xxyu_upload/spark-2.4.6-bin-hadoop2.7.tgz to ./spark-2.4.6-bin-hadoop2.7.tgz [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# tar zxf spark-2.4.6-bin-hadoop2.7.tgz [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# mv spark-2.4.6-bin-hadoop2.7 spark [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cat commit_SHA1 # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # b08c1be22eb51796fb58c3694e86e60f948337f6
Prepare kylin.properties
Edit $KYLIN_HOME/conf/kylin.properties and add following content.
kylin.metadata.url=kylin_default_instance@jdbc,url=jdbc:mysql://${MASTER_HOST}:3306/kylin,driverClassName=org.mariadb.jdbc.Driver,username=${USER_NAME},password=${PASSWORD} kylin.env.zookeeper-connect-string=${MASTER_HOST}
Prepare Metastore (Optional, only for test purpose)
[root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 70 Server version: 5.5.68-MariaDB MariaDB Server # Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> CREATE USER '${USER_NAME}'@'${MASTER_HOST}' IDENTIFIED BY '${PASSWORD}' ; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> GRANT ALL PRIVILEGES ON *.* TO '${USER_NAME}'@'${MASTER_HOST}' WITH GRANT OPTION ; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> FLUSH PRIVILEGES; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> Bye [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# mysql -u ${USER_NAME} -h ${MASTER_HOST} -p Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 73 Server version: 5.5.68-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> create database kylin; Query OK, 1 row affected (0.00 sec) MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | hive | | hue | | kylin | | mysql | | performance_schema | +--------------------+ 6 rows in set (0.00 sec) MariaDB [(none)]> Bye
Replace jars under $KYLIN_HOME/spark/jars
The Spark we downloaded is for Apache Hadoop 2.7, please replace with EMR-provided jars.
// Step 1 : Add jdbc driver(if you use mariadb) [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# find /usr/lib -name "*mariadb*" [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# mkdir ext [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/hive/lib/mariadb-connector-java.jar ext // Step 2 : replace Hadoop related jars [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# rm -rf $SPARK_HOME/jars/hadoop-*.jar [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/spark/jars/hadoop-*.jar $SPARK_HOME/jars/ [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/spark/jars/emr-spark-goodies.jar $SPARK_HOME/jars/ [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/spark/jars/htrace-core4-4.1.0-incubating.jar $SPARK_HOME/jars [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/hadoop-lzo/lib/hadoop-lzo-0.4.19.jar $SPARK_HOME/jars [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/hadoop/lib/woodstox-core-5.0.3.jar $SPARK_HOME/jars/ [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/hadoop/lib/stax2-api-3.1.4.jar $SPARK_HOME/jars/
Start Kylin
[root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# sh bin/kylin.sh start Retrieving hadoop conf dir... KYLIN_HOME is set to /home/hadoop/apache-kylin-4.0.0-SNAPSHOT-bin Retrieving hive dependency... Retrieving hadoop conf dir... Retrieving Spark dependency... Start replacing hadoop jars under /home/hadoop/apache-kylin-4.0.0-SNAPSHOT-bin/spark/jars. find: ‘/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/../hadoop/’: No such file or directory 2.10.0-amzn-0.jar Find platform specific jars: , will replace with these jars under /home/hadoop/apache-kylin-4.0.0-SNAPSHOT-bin/spark/jars. Please confirm that the corresponding hadoop jars have been replaced. The automatic replacement program cannot be executed correctly. Done hadoop jars replacement under /home/hadoop/apache-kylin-4.0.0-SNAPSHOT-bin/spark/jars. Start to check whether we need to migrate acl tables Not HBase metadata. Skip check. A new Kylin instance is started by root. To stop it, run 'kylin.sh stop' Check the log at /home/hadoop/apache-kylin-4.0.0-SNAPSHOT-bin/logs/kylin.log Web UI is at http://ip-172-31-1-253.cn-northwest-1.compute.internal:7070/kylin
Modify $KYLIN_HOME/hadoop_conf/hive-site.xml (after Kylin instance started)
[root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# ll hadoop_conf/ 总用量 0 lrwxrwxrwx 1 root root 30 1月 6 11:46 core-site.xml -> /etc/hadoop/conf/core-site.xml lrwxrwxrwx 1 root root 30 1月 6 11:46 hadoop-env.sh -> /etc/hadoop/conf/hadoop-env.sh lrwxrwxrwx 1 root root 30 1月 6 11:46 hdfs-site.xml -> /etc/hadoop/conf/hdfs-site.xml lrwxrwxrwx 1 root root 28 1月 6 11:46 hive-site.xml -> /etc/hive/conf/hive-site.xml lrwxrwxrwx 1 root root 32 1月 6 11:46 mapred-site.xml -> /etc/hadoop/conf/mapred-site.xml lrwxrwxrwx 1 root root 31 1月 6 11:46 ssl-client.xml -> /etc/hadoop/conf/ssl-client.xml lrwxrwxrwx 1 root root 30 1月 6 11:46 yarn-site.xml -> /etc/hadoop/conf/yarn-site.xml [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# rm hadoop_conf/hive-site.xml rm:是否删除符号链接 "hadoop_conf/hive-site.xml"?y [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /etc/hive/conf/hive-site.xml hadoop_conf/ # Change value of "hive.execution.engine" from "tez" to "mr" [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# vim hadoop_conf/hive-site.xml
Kylin on EMR 5.31 with Working set to S3
EmrFileSystem not found
If you configure "kylin.env.hdfs-working-dir=s3://XXX/kylin/", you will faced ClassNotFoundException .
21/01/06 13:07:15 INFO JobWorker: Start running job. 21/01/06 13:07:15 INFO SparkApplication: Executor task org.apache.kylin.engine.spark.job.ResourceDetectBeforeCubingJob with args : {"distMetaUrl":"kylin_default_instance@hdfs,path=s3://xiaoxiang-yu/kylin-workingdir/121212/kylin_default_instance/learn_kylin/job_tmp/6af48fa3-35a7-4007-bb54-81a2b3ae9c67-00/meta","submitter":"ADMIN","dataRangeEnd":"1356998400000","targetModel":"0928468a-9fab-4185-9a14-6f2e7c74823f","dataRangeStart":"1325376000000","project":"learn_kylin","className":"org.apache.kylin.engine.spark.job.ResourceDetectBeforeCubingJob","segmentName":"20120101000000_20130101000000","parentId":"6af48fa3-35a7-4007-bb54-81a2b3ae9c67","jobId":"6af48fa3-35a7-4007-bb54-81a2b3ae9c67","outputMetaUrl":"kylin_default_instance@jdbc,url=jdbc:mysql://ip-172-31-1-253.cn-northwest-1.compute.internal:3306/kylin,driverClassName=org.mariadb.jdbc.Driver,username=xxyu,password=newpassword","segmentId":"25dd0543-f7e2-813a-610c-dbd892ad4c99","cuboidsNum":"9","cubeName":"kylin_sales_cube_S3","jobType":"BUILD","cubeId":"6415d237-6671-284c-b7bc-12c02ef124eb","segmentIds":"25dd0543-f7e2-813a-610c-dbd892ad4c99"} 21/01/06 13:07:15 INFO MetaDumpUtil: Ready to load KylinConfig from uri: kylin_default_instance@hdfs,path=s3://xiaoxiang-yu/kylin-workingdir/121212/kylin_default_instance/learn_kylin/job_tmp/6af48fa3-35a7-4007-bb54-81a2b3ae9c67-00/meta 21/01/06 13:07:15 ERROR SparkApplication: The spark job execute failed! java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3256) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3288) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:120) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3339) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3307) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:473) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) at org.apache.kylin.engine.spark.utils.MetaDumpUtil.loadKylinConfigFromHdfs(MetaDumpUtil.java:122) at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:225) at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:89) at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2299) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) ... 14 more
Copy emrfs-hadoop-assembly.jar
To fix this, please copy related jar from env.
[root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.43.0.jar $SPARK_HOME/jars/
Screenshots
- Monitor Pages
- Sparder(Query Engine) UI
Kylin on EMR 6.0.0
Each step is same other than the "Replace jars under $KYLIN_HOME/spark/jars" .
[hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ hadoop version Hadoop 3.2.1-amzn-0 Source code repository git@aws157git.com:/pkg/Aws157BigTop -r 702dcbb487699cf833043bee677ea99c0136673e Compiled by ec2-user on 2020-02-19T04:10Z Compiled with protoc 2.5.0 From source with checksum d467a0d98b48769d63fc56b247d9b7e1 This command was run using /usr/lib/hadoop/hadoop-common-3.2.1-amzn-0.jar
Replace jars under $KYLIN_HOME/spark/jars
// Step 0 : Add jdbc driver(if you use mariadb) [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# find /usr/lib -name "*mariadb*" [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# mkdir ext [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/hive/lib/mariadb-connector-java.jar ext // Step 1 : Replace hadoop related jars [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp commons-configuration-1.10.jar lib/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ rm -rf $SPARK_HOME/jars/hadoop-*.jar [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/spark/jars/hadoop-*.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/spark/jars/emr-spark-goodies.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/spark/jars/htrace-core4-4.1.0-incubating.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/hadoop/lib/woodstox-core-5.0.3.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/spark/jars/commons-configuration2-2.1.1.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/spark/jars/re2j-1.1.jar $SPARK_HOME/jars/ [root@ip-172-31-1-253 apache-kylin-4.0.0-SNAPSHOT-bin]# cp /usr/lib/hadoop/lib/stax2-api-3.1.4.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/spark/jars/hive-exec-1.2.1-spark2-amzn-1.jar $SPARK_HOME/jars/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ rm spark/jars/hive-exec-1.2.1.spark2.jar // Step 2 : Change value of "hive.execution.engine" from "tez" to "mr" [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ rm hadoop_conf/hive-site.xml [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /etc/hive/conf/hive-site.xml hadoop_conf/ [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ vim hadoop_conf/hive-site.xml // Step 3 : Add lzo jars [hadoop@ip-172-31-8-116 apache-kylin-4.0.0-SNAPSHOT-bin]$ cp /usr/lib/hadoop-lzo/lib/hadoop-lzo-0.4.19.jar $SPARK_HOME/jars