Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add some docs about opening PRs

...

First of all, you need the Hive source code. As of April 2015 Hive has moved to git for its repository.

Get the source code on your local drive using git. See 83527164 below to understand which branch you should be using.

Code Block
git clone https://git-wip-us.apache.org/repos/asf/hive.gitgithub.com/apache/hive

Setting Up Eclipse Development Environment (Optional)

...

Understanding Hive Branches

As of June 2015, Hive has two sa few "main lines", master and branch-1X.

All new feature work and bug fixes in Hive are contributed to the master branch. As of June 2015, releases from master are numbered 2.x. The 2Releases are done from branch-X. Major versions like 2.x versions are not necessarily backwards compatible with 1.x versions.backwards compatibility will be accepted on branch-1 is used to build stable, backward compatible releases. Releases from this branch are numbered 1.x (where 1.3 will be the first release from it, as 1.2 was released from master prior to the creation of branch-1). Until at least June 2016 all critical bug fixes (crashes, wrong results, security issues) applied to master must also be applied to branch-1. The decision to port a feature from master to branch-1 is at the discretion of the contributor and committer. However no features that break backwards compatibility will be accepted on branch-1.

In addition to these main lines Hive has two types of branches, release branches and feature branches.

Release branches are made from branch-1 (for 1.x) or master (for 2.x) when the community is preparing a Hive release. Release branches match the number of the release (e.g., branch-1.2 for Hive 1.2). For patch releases the branch is made from the existing release branch (to avoid picking up new features from the main line). For example, if a 1.2.1 release was being made branch-1.2.1 would be made from the tip of branch-1.2. Once a release branch has been made, inclusion of additional patches on that branch is at the discretion of the release manager. After a release has been made from a branch, additional bug fixes can still be applied to that branch in anticipation of the next patch release. Any bug fix applied to a release branch must first be applied to master (and branch-1 if applicable).

Feature branches are used to develop new features without destabilizing the rest of Hive. The intent of a feature branch is that it will be merged back into master once the feature has stabilized.

For general information about Hive branches, see Hive Versions and Branches.

Hadoop Dependencies

Hadoop dependencies are handled differently in master and branch-1.

branch-1

.

In addition to these main lines Hive has two types of branches, release branches and feature branches.

Release branches are made from branch-1 (for 1.x) or master (for 2.x) when the community is preparing a Hive release. Release branches match the number of the release (e.g., branch-1.2 for Hive 1.2). For patch releases the branch is made from the existing release branch (to avoid picking up new features from the main line). For example, if a 1.2.1 release was being made branch-1.2.1 would be made from the tip of branch-1.2. Once a release branch has been made, inclusion of additional patches on that branch is at the discretion of the release manager. After a release has been made from a branch, additional bug fixes can still be applied to that branch in anticipation of the next patch release. Any bug fix applied to a release branch must first be applied to master (and branch-1 if applicable).

Feature branches are used to develop new features without destabilizing the rest of Hive. The intent of a feature branch is that it will be merged back into master once the feature has stabilized.

For general information about Hive branches, see Hive Versions and Branches.

Hadoop Dependencies

Hadoop dependencies are handled differently in master and branch-1.

branch-1

In branch-1 both Hadoop 1.x and 2.x are supported. The Hive build downloads a number of different Hadoop versions via Maven in order to compile "shims" which allow for compatibility with these Hadoop versions. However, the rest of Hive is only built and tested against a single Hadoop version.

The Maven build has two profiles, hadoop-1 for Hadoop 1.x and hadoop-2 for Hadoop 2.x. When building, you must specify which profile you wish to use via Maven's -P command line option (see How to build all source).

...

branch-2

Hadoop 1.x is no longer supported in Hive's master branch. There is no need to specify a profile for most Maven commands, as Hadoop 2.x will always be chosen.

Info
titleHadoop Version Information

On this page we assume you are building from the master branch and do not include the profile in the example Maven commands. If you are building on branch-1 you will need to select the appropriate profile for the version of Hadoop you are building against.

Unit Tests

Please make sure that all unit tests succeed before and after applying your patch and that no new javac compiler warnings are introduced by your patch. Also see the information in the previous section about testing with different Hadoop versions if you want to verify compatibility with something other than the default Hadoop version.

When When submitting a patch it's highly recommended you execute tests locally which you believe will be impacted in addition to any new tests. The full test suite can be executed by Hive PreCommit Patch Testing. Hive Developer FAQ describes how to execute a specific set of tests.

Code Block
> cd hive-trunk
> mvn clean install -DskipTests
> mvn test -Dtest=SomeTest

After a while, if you see

Code Block
[INFO] BUILD SUCCESS

all is ok, but if you see

Code Block
[INFO] BUILD FAILURE

then you should fix things before proceeding.

Unit tests take a long Unit tests take a long time (several hours) to run sequentially even on a very fast machine; for information on how to run them in parallel, see Hive PreCommit Patch Testing.

Add a Unit Test

There are two kinds of unit tests that can be added: those that test an entire component of Hive, and those that run a query to test a feature.

...

  • Add a new class (name must start with Test) in the component's */src/test/java directory.
  • To test only the new testcase, run mvn test -Dtest=TestAbc (where TestAbc is the name of the new class), which will be faster than mvn test which tests all testcases.

...

  • Add a new XXXXXX.q file in ql/src/test/queries/clientpositive. (Optionally, add a new XXXXXX.q file for a query that is expected to fail in ql/src/test/queries/clientnegative.)
  • Run mvn test -Dtest=TestCliDriver -Dqfile=XXXXXX.q -Dtest.output.overwrite=true. This will generate a new XXXXXX.q.out file in ql/src/test/results/clientpositive.
    • If you want to run multiple .q files in the test run, you can specify comma separated .q files, for example -Dqfile="X1.q,X2.q". You can also specify a Java regex, for example -Dqfile_regex='join.*'. (Note that it takes Java regex, i.e., 'join.*' and not 'join*'.) The regex match first removes the .q from the file name before matching regex, so specifying join*.q will not work.

If

...

If the feature is added the feature is added in contrib:

  • Do the steps above, replacing ql with contrib, and TestCliDriver with TestContribCliDriver.If you are using Hive 0.11.0 or later, you can specify -Dmodule=contrib.

See the FAQ "How do I add a test case?" for more details.

...

  • -Dqfile=XXXXXX.q  - To run one or more specific query file tests. For the exact format, check the Query Unit Test paragraph. If not provided only those query files from ql/src/test/queries/clientpositive directory will be run which are mentioned in itests/src/test/resources/testconfiguration.properties in the beeline.positive.include parameter.

  • -Dtest.output.overwrite=true - This will rewrite the output of the q.out files in ql/src/test/results/clientpositive/beeline. The default value is false, and it will check the current output against the golden files

  • -Dtest.beeline.compare.portable - If this parameter is true, the generated and the golden query output files will be filtered before comparing them. This way the existing query tests can be run against different configurations using the same golden output files. The result of the following commands will be filtered out from the output files: EXPLAIN, DESCRIBE, DESCRIBE EXTENDED, DESCRIBE FORMATTED, SHOW TABLES, SHOW FORMATTED INDEXES and SHOW DATABASES.
    The default value is false.
  • -Djunit.parallel.threads=1 - The number of the parallel threads running the tests. The default is 1. There were some flakiness caused by parallelization
  • -Djunit.parallel.timeout=10 - The tests are terminated after the given timeout. The parameter is set in minutes and the default is 10 minutes. (As of HIVE 3.0.0.)
  • The BeeLine tests could run against an existing cluster. Or if not provided, then against a MiniHS2 cluster created during the tests.
    • -Dtest.beeline.url - The jdbc url which should be used to connect to the existing cluster. If not set then a MiniHS2 cluster will be created instead.
    • -Dtest.beeline.user - The user which should be used to connect to the cluster. If not set "user" will be used.
    • -Dtest.beeline.password - The password which should be used to connect to the cluster. If not set "password" will be used.
    • -Dtest.data.dir - The test data directory on the cluster. If not set <HIVEROOT>/data/files will be used.
    • -Dtest.results.dir - The test results directory to compare against. If not set the default configuration will be used.
    • -Dtest.init.script - The test init script. If not set the default configuration will be used.
    • -Dtest.beeline.shared.database - If true, then the default database will be used, otherwise a test-specific database will be created for every run. The default value is false.

Debugging

    • - If true, then the default database will be used, otherwise a test-specific database will be created for every run. The default value is false.

Debugging

Please see Debugging Hive code in Development Guide.

Submitting a PR

There are many excellent howtos about how to submit pullrequests for github projects. The following is one way to approach it:

Setting up a repo with 2 remotes; I would recommend to use the github user as the remote name - as it may make things easier if you need to add someone else's repo as well.

Code Block
# clone the apache/hive repo from github
git clone --origin apache https://github.com/apache/hive
cd hive
# add your own fork as a remote
git remote add GITHUB_USER git@github.com:GITHUB_USER/hive

You will need a separate branch to make your changes; you need to this for every PR you are doing.

Code Block
# fetch all changes - so you will create your feature branch on top of the current master
git fetch --all
# create a feature branch This branch name can be anything - including the ticket id may help later on identifying the branch.
git branch HIVE-9999-something apache/master
git checkout HIVE-9999-something
# push your feature branch to your github fork - and set that branch as upstream to this branch
git push GITHUB_USER -u HEAD

Make your change

Code Block
# make your changes; you should include the ticketid + message in the first commit message
git commit -m 'HIVE-9999: Something' -a
# a simple push will deliver your changes to the github branch
git push

If you think your changes are ready to be tested and reviewed - you could open a PR request on the https://github.com/apache/hive page.

If you need to make changes you just need to push further changes to the branch - but keep in mind that any new commit will trigger a new testrun; and the time needed to execute tests is measured in hours - so keep this in mind when you are pushing in changesPlease see Debugging Hive code in Development Guide.

Creating a Patch

After you have committed a change or a set of changes to your local repository, you need to create a patch to post on the JIRA. The naming convention for patches is:

...