Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. 创建$SPARK_HOME/carbonlib。
  2. 复制~/cwiki/carbondata-parent-2.1.0/assembly/target/scala-2.11apache-carbondata-2.1.0-bin-spark2.4.5-hadoop2.7.2.jar到$SPARK_HOME/carbonlib目录。
  3. 将/home/chen/cwiki/carbondata-parent-2.1.0/conf目录下carbon.properties.template 复制到$SPARK_HOME/conf/目录下并重命名为carbon.properties。

  4. 压缩carbonlib文件生成 carbonlib.tar.gz文件并将其移动到carbonlib文件夹中。
  5. 在$SPARK_HOME/conf/spark-defaults.conf文件中配置属性。

    【注】spark.master 和spark.eventLog.dir 使用自己的配置

    Code Block
    languagepowershell
    themeDJango
    titleshell
    # Example:
    spark.master spark://ubuntu:7077
    spark.yarn.dist.files /usr/local/spark/conf/carbon.properties
    spark.yarn.dist.archives /usr/local/spark/carbonlib/carbondata.tar.gz
    spark.executor.extraJavaOptions -Dcarbon.properties.filepath=carbon.properties
    spark.executor.extraClassPath carbondata.tar.gz/carbonlib/*
    spark.driver.extraClassPath /usr/local/spark/carbonlib/*
    spark.driver.extraJavaOptions -Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties
    # spark.eventLog.enabled true
    spark.eventLog.enabled true
    spark.eventLog.dir hdfs://localhost:9000/directory
    # spark.serializer org.apache.spark.serializer.KryoSerializer
    # spark.driver.memory 5g
    # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
    spark.yarn.historyServer.address=localhost:7777
    spark.history.ui.port=7777

    5、验证安装

    Code Block
    languagepowershell
    themeDJango
    titleshell
    ./bin/spark-shell \
    --master yarn-client \
    --driver-memory 1G \
    --executor-memory 2G \
    --executor-cores 2

    【注】在PATH中配置了spark的bin目录后,在任意目录下可以使用spark shell等命令


Spark shell使用carbondata


Code Block
languagepowershell
themeDJango
titleshell
# 1、使用自己编译打包的jar包
spark-shell --conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions --jars /usr/local/carbondata-parent-2.1.0/assembly/target/scala-2.11/apache-carbondata-2.1.0-bin-spark2.4.5-hadoop2.7.2.jar 
#2、使用官方的jar包
spark-shell --conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions --jars /usr/local/spark/carbonlib/apache-carbondata-2.1.0-bin-spark2.4.5-hadoop2.7.2.jar   
#两者选其一即可

...

Code Block
languagesql
themeDJango
titleshell
#使用内置的SparkSession对象`spark`
spark.sql(
s"""
| CREATE TABLE IF NOT EXISTS cwikitest_table(
| id string,
| name string,
| city string,
| age Int)
| STORED AS carbondata
""".stripMargin)

spark.sql(
s"""
| CREATE TABLE IF NOT EXISTS cwikitest_table(
| id string,
| name string,
| city string,
| age Int)
| USING carbondata
""".stripMargin)

【注1】spark里面用USING 代表的是spark datasource表;用STORED AS代表的是hive格式表。

【注2】同理会在启动spark【注】同理会在启动spark-shell目录下创建spark-warehouse文件,该目录下会创建与cwikitest_table表名字相同的目录文件

...

Code Block
languagepowershell
themeDJango
titlespark shell
# 加载本地文件数据
Sparkspark.sql("LOAD DATA INPATH 'file:///home/chen/carbondata/sample.csv' INTO TABLE testcwikitest_table")
# 加载hdfs文件数据
spark.sql("LOAD DATA INPATH 'hdfs://localhost:9000/user/chen/carbondata/sample.csv' INTO TABLE testcwikitest_table")
#默认是hdfs的路径
spark.sql("LOAD DATA INPATH '/user/chen/carbondata/sample.csv' INTO TABLE test_table") cwikitest_table")

【注】由于默认路径是hdfs加载本地路径时添加file://,另外如果是在集群中加载本地文件要确保每个节点的本地文件路径中有该文件。

执行查询语句

Code Block
languagepowershell
themeDJango
titlespark shell
spark.sql("SELECT * FROM cwikitest_table").show()

spark.sql(
           s"""
              | SELECT city, avg(age), sum(age)
              | FROM cwikitest_table
              | GROUP BY city
           """.stripMargin).show()

Image Added