You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Configuring Hive

A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference:

  • Using the set command in the cli for setting session level values for the configuration variable for all statements subsequent to the set command. e.g.
      set hive.exec.scratchdir=/tmp/mydir;
    
    sets the scratch directory (which is used by hive to store temporary output and plans) to /tmp/mydir for all subseq
  • Using -hiveconf option on the cli for the entire session. e.g.
      bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir
    
  • In hive-site.xml. This is used for setting values for the entire Hive configuration. e.g.
      <property>
        <name>hive.exec.scratchdir</name>
        <value>/tmp/mydir</value>
        <description>Scratch space for Hive jobs</description>
      </property>
    
  • hive-default.xml - This configuration file contains the default values for various configuration variables that come with prepackaged in a Hive distribution. These should not be changed by the administrator. In order to override any of the values, create hive-site.xml instead and set the value in that file as shown above.

hive-default.xml is located in the conf directory in your installation root. hive-site.xml should also be created in the same directory.

Broadly the configuration variables are categorized into:

Hive Configuration Variables

Variable Name

Description

Default Value

hive.exec.script.wrapper

Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as python <script command>. If the value is null or not set, the script is invoked as <script command>.

null

hive.exec.plan

 

null

hive.exec.scratchdir

This directory is used by hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages.

/tmp/<user.name>/hive

hive.querylog.location

Directory where structured hive query logs are created. One file per session is created in this directory. If this variable set to empty string structured log will not be created.

/tmp/<user.name>

hive.exec.submitviachild

Determines whether the map/reduce jobs should be submitted through a separate jvm in the non local mode.

false - By default jobs are submitted through the same jvm as the compiler

hive.exec.script.maxerrsize

Maximum number of serialization errors allowed in a user script invoked through TRANSFORM or MAP or REDUCE constructs.

100000

hive.exec.compress.output

Determines whether the output of the final map/reduce job in a query is compressed or not.

false

hive.exec.compress.intermediate

Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not.

false

hive.jar.path

The location of hive_cli.jar that is used when submitting jobs in a separate jvm.

 

hive.aux.jars.path

The location of the plugin jars that contain implementations of user defined functions and serdes.

 

hive.partition.pruning

A strict value for this variable indicates that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table.

nonstrict

hive.map.aggr

Determines whether the map side aggregation is on or not.

true

hive.join.emit.interval

 

1000

hive.map.aggr.hash.percentmemory

 

(float)0.5

hive.default.fileformat

Default file format for CREATE TABLE statement. Options are TextFile, SequenceFile and RCFile

TextFile

hive.merge.mapfiles

Merge small files at the end of a map-only job.

true

hive.merge.mapredfiles

Merge small files at the end of a map-reduce job.

false

hive.merge.size.per.task

Size of merged files at the end of the job.

256000000

hive.merge.smallfiles.avgsize

When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

16000000

hive.enforce.bucketing

If enabled, enforces inserts into bucketed tables to also be bucketed

false

Hive MetaStore Configuration Variables

Variable Name

Description

Default Value

hive.metastore.metadb.dir

 

 

hive.metastore.warehouse.dir

Location of the default database for the warehouse

 

hive.metastore.uris

 

 

hive.metastore.usefilestore

 

 

hive.metastore.rawstore.impl

 

 

hive.metastore.local

 

 

javax.jdo.option.ConnectionURL

JDBC connect string for a JDBC metastore

 

javax.jdo.option.ConnectionDriverName

Driver class name for a JDBC metastore

 

javax.jdo.option.ConnectionUserName

 

 

javax.jdo.option.ConnectionPassword

 

 

org.jpox.autoCreateSchema

Creates necessary schema on startup if one doesn't exist. (e.g. tables, columns...) Set to false after creating it once.

 

org.jpox.fixedDatastore

Whether the datastore schema is fixed.

 

hive.metastore.checkForDefaultDb

 

 

hive.metastore.ds.connection.url.hook

Name of the hook to use for retriving the JDO connection URL. If empty, the value in javax.jdo.option.ConnectionURL is used as the connection URL

 

hive.metastore.ds.retry.attempts

The number of times to retry a call to the backing datastore if there were a connection error

1

hive.metastore.ds.retry.interval

The number of miliseconds between datastore retry attempts

1000

hive.metastore.server.min.threads

Minimum number of worker threads in the Thrift server's pool.

200

hive.metastore.server.max.threads

Maximum number of worker threads in the Thrift server's pool.

10000

Hive Configuration Variables used to interact with Hadoop

Variable Name

Description

Default Value

hadoop.bin.path

The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm.

$HADOOP_HOME/bin/hadoop

hadoop.config.dir

The location of the configuration directory of the hadoop installation

$HADOOP_HOME/conf

fs.default.name

 

file:///

map.input.file

 

null

mapred.job.tracker

The url to the jobtracker. If this is set to local then map/reduce is run in the local mode.

local

mapred.reduce.tasks

The number of reducers for each map/reduce stage in the query plan.

1

mapred.job.name

The name of the map/reduce job

null

Hive Variables used to pass run time information

Variable Name

Description

Default Value

hive.session.id

The id of the Hive Session.

 

hive.query.string

The query string passed to the map/reduce job.

 

hive.query.planid

The id of the plan for the map/reduce stage.

 

hive.jobname.length

The maximum length of the jobname.

50

hive.table.name

The name of the hive table. This is passed to the user scripts through the script operator.

 

hive.partition.name

The name of the hive partition. This is passed to the user scripts through the script operator.

 

hive.alias

The alias being processed. This is also passed to the user scripts through the script operator.

 

Temporary Folders

Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:

  • On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir
  • On the client machine, this is hardcoded to /tmp/<username>

Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

Log Files

Hive client produces logs and history files on the client machine. Please see Error Logs on configuration details.

  • No labels