This guide is optional for contributors. It is not necessary to use GitHub to contribute patches.
Note: This content was moved over from https://wiki.apache.org/hadoop/GithubIntegration
There are several ways to setup Git for committers and contributors. Contributors can safely setup Git any way they choose but committers should take extra care since they can push new commits to the trunk at Apache and various policies there make backing out mistakes problematic. To keep the commit history clean take note of the use of
--squash below when merging into
Git setup for Committers
This describes setup for one local repo and two remotes. It allows you to push the code on your machine to either your GitHub repo or to gitbox.apache.org. You will want to fork GitHub's
apache/hadoop to your own account on GitHub, this will enable Pull Requests of your own. Cloning this fork locally will set up "origin" to point to your remote fork on GitHub as the default remote. So if you perform `git push origin trunk` it will go to GitHub.
To attach to the Apache git repo do the following:
To check your remote setup:
you should see something like this:
Now if you want to experiment with a branch everything, by default, points to your github account because
origin is the. You can work as normal using only github until you are ready to merge with the apache remote. Some conventions will integrate with Apache Jira ticket numbers.
Once you are ready to commit to the apache remote you can merge and push them directly or better yet create a PR.
We recommend creating new branches under
feature/ to help group ongoing work, especially now that as of November 2015, forced updates are disabled on ASF branches. We hope to reinstate that ability on feature branches to aid development.
How to create a PR (committers)
Push your branch to GitHub:
- Go to your
feature/hadoop-xxxxbranch on Github. Since you forked it from Github's
apache/hadoopit will default any PR to go to
- Click the green "Compare, review, and create pull request" button.
- You can edit the to and from for the PR if it isn't correct. The "base fork" should be
apache/hadoopunless you are collaborating separately with one of the committers on the list. The "base" will be trunk. Don't submit a PR to one of the other branches unless you know what you are doing. The "head fork" will be your forked repo and the "compare" will be your `feature/hadoop-xxxx` branch.
- Click the "Create pull request" button and name the request "HADOOP-XXXX" all caps. This will connect the comments of the PR to the mailing list and Jira comments.
- From now on the PR lives on github's
apache/hadooprepository. You use the commenting UI there.
If you are looking for a review or sharing with someone else say so in the comments but don't worry about automated merging of your PR —you will have to do that later. The PR is tied to your branch so you can respond to comments, make fixes, and commit them from your local repo. They will appear on the PR page and be mirrored to Jira and the mailing list. When you are satisfied and want to push it to Apache's remote repo proceed with Merging a PR
How to create a PR (contributors)
Create pull requests: https://help.github.com/articles/creating-a-pull-request/.
Pull requests are made to
apache/hadoop repository on Github. In the Github UI you should pick the trunk branch to target the PR as described for committers. This will be reviewed and commented on so the merge is not automatic. This can be used for discussing a contributions in progress.
How to run Jenkins precommit job for a PR (committers)
Jenkins precommit job is automatically run for the PRs created by committers, but not run for the PRs created by non-committers. If you are a committer and want to run Jenkins precommit job manually, log in to https://builds.apache.org/view/H-L/view/Hadoop/job/hadoop-multibranch/ and click "Scan Repository Now" link.
Merging a PR (yours or contributors)
Start with reading https://help.github.com/articles/checking-out-pull-requests-locally/.
Remember that pull requests are equivalent to a remote GitHub branch with potentially a multitude of commits. In this case it is recommended to squash remote commit history to have one commit per issue, rather than merging in a multitude of contributor's commits. In order to do that, as well as close the PR at the same time, it is recommended to use squash commits.
Merging pull requests are equivalent to a "pull" of a contributor's branch:
--squash option ensures all PR history is squashed into single commit, and allows committer to use his/her own message. Read git help for merge or pull for more information about
--squash option. In this example we assume that the contributor's GitHub handle is "cuser" and the PR branch name is "cbranch". Next, resolve conflicts, if any, or ask a contributor to rebase on top of trunk, if PR went out of sync.
If you are ready to merge your own (committer's) PR you probably only need to merge (not pull), since you have a local copy that you've been working on. This is the branch that you used to create the PR.
Remember to run regular patch checks, build with tests enabled, and change CHANGES.TXT (not applicable for Hadoop versions 2.8.0 and later) for the appropriate part of the project.
If everything is fine, you now can commit the squashed request along the lines
HADOOP-XXXX is all caps and where ZZ is the pull request number on apache/hadoop repository. Including `closes apache/hadoop#ZZ` will close the PR automatically. More information is found at https://help.github.com/articles/closing-issues-via-commit-messages. Next, push to gitbox.apache.org:
(this will require Apache handle credentials).
The PR, once pushed, will get mirrored to GitHub. To update your personal GitHub version push there too:
Note on squashing: Since squash discards remote branch history, repeated PRs from the same remote branch are difficult for merging. The workflow implies that every new PR starts with a new rebased branch. This is more important for contributors to know, rather than for committers, because if new PR is not mergeable, github would warn to begin with. Anyway, watch for dupe PRs (based on same source branches). This is a bad practice.
Closing a PR without committing (for committers)
Now Hadoop committer can directly close GitHub pull requests. If you are a committer and don't have the privilege, you need to link your ASF and GitHub account via https://gitbox.apache.org/setup/
Apache/github integration features
Read https://blogs.apache.org/infra/entry/improved_integration_between_apache_and. Comments and PRs with Hadoop issue handles should post to mailing lists and Jira. Hadoop issue handles must in the form `HADOOP-YYYYY` (all capitals). Usually it makes sense to file a JIRA issue first, and then create a PR with description
In this case all subsequent comments will automatically be copied to JIRA without having to mention the JIRA issue explicitly in each comment of the PR.
Avoiding accidentally committing private branches to the ASF repo
Its dangerously easy —especially when using IDEs— to accidentally commit changes to the ASF repo, be it direct to the `trunk`, `branch-2` or other standard branch on which you are developing, or to a private branch you had intended to keep on GitHub (or a private repo).
Committers can avoid this by having the directory in which they develop code set up with read only access to the ASF repository on GitHub, without the Apache repository added. A separate directory should be set up with write access to the ASF repository as well as read access to your other repositories. Merging operations and pushes back to the ASF repo are done from this directory —so isolated from all local development.
If you accidentally commit a patch to an ASF branch, do not attempt to roll back the branch and force out a new update. Simply commit and push out a new patch revoking the change.
If you do accidentally commit a branch to the ASF repo, the infrastructure team can delete it —but they cannot stop it propagating to GitHub and potentially being visible. Try not to do that.
Avoiding accidentally committing private keys to Amazon AWS, Microsoft Azure or other cloud infrastructures
All the cloud integration projects under
hadoop-tools expect a resource file,
resources/auth-keys.xml to contain the credentials for authenticating with cloud infrastructures. These files are explicitly excluded from git through entries in
.gitignore. To avoid running up large bills and/or exposing private data, it is critical to keep any of your credentials secret.
For maximum security here, clone your Hadoop repository into create separate directory for cloud tests, one with read-only access. Create the
auth-keys.xml files there. This guarantees that you cannot commit the credentials, albeit with a somewhat more complex workflow, as patches must be pushed to a git repository before being pulled and tested into the cloud-enabled directory.
Accidentally committing secret credentials is a very expensive mistake. You will not only need to revoke your keys, you will need to kill all bitcoining machines created on all EC2 zones, and all outstanding spot-price bids for them.