Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.


Part-I Why we need a read-write separation deployment


Background

You maybe know that Kylin 4.0 is using Spark to build cube and query cube, if the build (spark) jobs and query (spark) jobs are both running in the same Hadoop cluster, the performance of build/query will be affected because of the resource competition

Now, Kylin 4.0 supports running build jobs and query jobs on the two Hadoop clusters. The build (spark) jobs will be sent to build cluster to build cube data, and then the cube/cuboid data will be wrote to the HDFS of the query cluster directly, so that query related workload will be move to query cluster.

With a read-write separation deployment, we can spearate build and query computation workload(mostly the yarn resourece). In the current implementation, HDFS of query cluster is not exclusive to the query cluster only, because build cluster will read data from query cluster('s HDFS) when merging segments.

Architecture of Read-Write Separation

Note

  1. Kylin 4.0 uses the Yarn resources on build cluster to build cube data and then write the cube data back to the HDFS on query cluster directly.
  2. Build jobs read the Hive data sources which are on build cluster.
  3. When executing pushdown query, Kylin 4.0 reads the Hive data from build cluster.


Part-II How to set up a read-write separation deployment


Preparation

  1. Make sure the hadoop version ( HDP or CDH ) of build cluster and query cluster is supported by Kylin.
  2. The hadoop version of build cluster and query cluster must be the same.
  3. Check commands like hdfs and hive are all working properly and can access cluster resources.
  4. If the two clusters have enabled the HDFS NameNode HA, please make sure their HDFS Nameservice names are different. If they are the same, please change one of them to avoid conflict.
  5. Please make sure the network latency between the two clusters is low enough, as there will be a large number of data moved back and forth during model build process.

How to

  1. Install Kylin 4.0 with the installation guide in one node which contains Hadoop client configuration of query cluster.
  2. Create a directory called build_hadoop_conf in the $KYLIN_HOME and then copy the hadoop configuration files of the build cluster into this directory ( Note: make sure copy the real configuration files, not the symbolic links ).
  3. Set the value of the configuration 'kylin.engine.submit-hadoop-conf-dir' in the '$KYLIN_HOME/conf/kylin.properties' to the directory created in the step 2.
  4. Copy the hive-site.xml from build cluster to query cluster into right directory (typically, hive configuration should located at : /etc/hive/conf). So the query can load Hive metadata from build cluster properly.
  5. The value of the configuration 'kylin.engine.spark-conf.spark.yarn.queue' in the '$KYLIN_HOME/conf/kylin.properties' should be configured as the queue of the Yarn on build cluster.


Part-III FAQ


  • $KYLIN_HOME/bin/sample.sh is not supported in this deployment mode.


  • No labels