Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

As we all know, Kylin 4.0 needs to build cube data before querying, so if the build jobs and query jobs are both running in one cluster, service may be unstable because of the resource preemption. 

Now, Kylin 4.0 supports to run build jobs and query jobs on the different Hadoop clusters which we call build cluster and query cluster. The build jobs will be sent to build cluster to build cube data, and then the cube data will be wrote to the HDFS  on the query cluster directly, so that we can execute query to read cube data from the query cluster.

With a read-write separation deployment, we can completely isolate build and query workloads.

Read-Write Separation Architecture

Notes:

  1. Kylin 4.0 uses the Yarn resources on build cluster to build cube data and then write the cube data back to the HDFS on query cluster directly.
  2. Build jobs read the Hive data sources which are on build cluster.
  3. When executing pushdown query, Kylin 4.0 reads the Hive data from build cluster.

Prepare

  1. Make sure the hadoop version ( HDP or CDH ) of build cluster and query cluster is supported by Kylin.
  2. The hadoop version of build cluster and query cluster must be the same.
  3. Check commands like hdfs and hive are all working properly and can access cluster resources.
  4. If the two clusters have enabled the HDFS NameNode HA, please check and make sure their HDFS nameservice names are different. If they are the same, please change one of them to avoid conflict.
  5. Please make sure the network latency between the two clusters is low enough, as there will be a large number of data moved back and forth during model build process.

Configuration

  1. Install Kylin 4.0 by the following guide on Kylin server.
  2. Create a directory called 'build_hadoop_conf' in the $KYLIN_HOME and then copy the hadoop configuration files of the build cluster to this directory (Note: make sure copy the real configuration files not the symbolic links).
  3. Set  the value of the configuration 'kylin.engine.submit-hadoop-conf-dir' in the '$KYLIN_HOME/conf/kylin.properties' to the directory created in the step 2.
  4. Copy the hive-site.xml of the build cluster into the directory of the hive configuration on query cluster, for example: /etc/hive/conf.
  5. The value of the configuration 'kylin.engine.spark-conf.spark.yarn.queue' in the '$KYLIN_HOME/conf/kylin.properties' should be configured as the queue of the Yarn on build cluster.

Note

  • $KYLIN_HOME/bin/sample.sh is not supported in this deployment mode.


  • No labels