This guide documents the best way to make various types of contribution to Apache Spark, including what is required before submitting a code change.
Contributing to Spark doesn't just mean writing code. Helping new users on the mailing list, testing releases, and improving documentation are also welcome. In fact, proposing significant code changes usually requires first gaining experience and credibility within the community by helping in other ways. This is also a guide to becoming an effective contributor.
Contributing by Helping Other Users
A great way to contribute to Spark is to help answer user questions on the firstname.lastname@example.org mailing list. There are always many new Spark users; taking a few minutes to help answer a question is a very valuable community service.
Contributors should subscribe to this list and follow it in order to keep up to date on what's happening in Spark. Answering questions is an excellent and visible way to help the community, which also demonstrates your expertise.
Contributing by Testing Releases
Spark's release process is community-oriented, and members of the community can vote on new releases on the email@example.com mailing list. Spark users are invited to subscribe to this list to receive announcements (e.g. the Spark 1.3.1 release vote), and test their workloads on newer release and provide feedback on any performance or correctness issues found in the newer release.
Contributing by Reviewing Changes
Changes to Spark source code are proposed, reviewed and committed via Github (described later) at http://github.com/apache/spark/pulls. Anyone can view and comment on active changes here. Reviewing others' changes is a good way to learn how the change process works and gain exposure to activity in various parts of the code. You can help by reviewing the changes and asking questions or pointing out issues -- as simple as typos or small issues of style.
See also https://spark-prs.appspot.com for a convenient way to view and filter open PRs.
Contributing Documentation Changes
To have us add a link to an external tutorial you wrote, simply email the developer mailing list.
To modify the built-in documentation, edit the Markdown source files in Spark's docs directory, whose README file shows how to build the documentation locally to test your changes.
The process to propose a doc change is otherwise the same as the process for proposing code changes below. Note that changes to the site outside of docs must be handled manually by committers, since the rest of the http://spark.apache.org/ site is hosted at Apache and not versioned in Github. In these cases, a patch can be attached to a JIRA instead (also described below).
Contributing User Libraries to Spark
Just as Java and Scala applications can access a huge selection of libraries and utilities, none of which are part of Java or Scala themselves, Spark aims to support a rich ecosystem of libraries. Many new useful utilities or features belong outside of Spark rather than in the core. For example: language support probably has to be a part of core Spark, but, useful machine learning algorithms can happily exist outside of MLlib.
To that end, large and independent new functionality is often rejected for inclusion in Spark itself, but, can and should be hosted as a separate project and repository, and included in the http://spark-packages.org/ collection.
Contributing Bug Reports
Ideally, bug reports are accompanied by a proposed code change to fix the bug. This isn't always possible, as those who discover a bug may not have the experience to fix it. A bug may be reported by creating a JIRA but without creating a pull request (see below).
Bug reports are only useful however if they include enough information to understand, isolate and ideally reproduce the bug. Simply encountering an error does not mean a bug should be reported; as below, search JIRA and search and inquire on the mailing lists first. Unreproducible bugs, or simple error reports, may be closed.
It is possible to propose new features as well. These are generally not helpful unless accompanied by detail, such as a design document and/or code change. Large new contributions should consider http://spark-packages.org first (see above), or be discussed on the mailing list first. Feature requests may be rejected, or closed after a long period of inactivity.
Preparing to Contribute Code Changes
Choosing What to Contribute
Spark is an exceptionally busy project, with a new JIRA or pull request every few hours on average. Review can take hours or days of committer time. Everyone benefits if contributors focus on changes that are useful, clear, easy to evaluate, and already pass basic checks.
Before proceeding, contributors should evaluate if the proposed change is likely to be relevant, new and actionable:
Is it clear that code must change? Proposing a JIRA and pull request is appropriate only when a clear problem or change has been identified. If simply having trouble using Spark, use the mailing lists first, rather than consider filing a JIRA or proposing a change. When in doubt, email firstname.lastname@example.org first about the possible change
Search the email@example.com and firstname.lastname@example.org mailing list archives for related discussions. Use http://search-hadoop.com/?q=&fc_project=Spark or similar search tools. Often, the problem has been discussed before, with a resolution that doesn't require a code change, or recording what kinds of changes will not be accepted as a resolution.
Search JIRA for existing issues: https://issues.apache.org/jira/browse/SPARK
Type "spark [search terms]" at the top right search box. If a logically similar issue already exists, then contribute to the discussion on the existing JIRA and pull request first, instead of creating a new one.
Is the scope of the change matched to the contributor's level of experience? Anyone is qualified to suggest a typo fix, but refactoring core scheduling logic requires much more understanding of Spark. Some changes require building up experience first (see above).
MLlib-specific Contribution Guidelines
While a rich set of algorithms is an important goal for MLLib, scaling the project requires that maintainability, consistency, and code quality come first. New algorithms should:
Be widely known
Be used and accepted (academic citations and concrete use cases can help justify this)
Be highly scalable
Be well documented
Have APIs consistent with other algorithms in MLLib that accomplish the same thing
Come with a reasonable expectation of developer support.
Code Review Criteria
Before considering how to contribute code, it's useful to understand how code is reviewed, and why changes may be rejected. Simply put, changes that have many or large positives, and few negative effects or risks, are much more likely to be merged, and merged quickly. Risky and less valuable changes are very unlikely to be merged, and may be rejected outright rather than receive iterations of review.
Fixes the root cause of a bug in existing functionality
Adds functionality or fixes a problem needed by a large number of users
Maintains or improves consistency across Python, Java, Scala
Easily tested; has tests
Reduces complexity and lines of code
Change has already been discussed and is known to committers
Band-aids a symptom of a bug only
Introduces complex new functionality, especially an API that needs to be supported
Adds complexity that only helps a niche use case
Adds user-space functionality that does not need to be maintained in Spark, but could be hosted externally and indexed by http://spark-packages.org
Changes a public API or semantics (rarely allowed)
Adds large dependencies
Changes versions of existing dependencies
Adds a large amount of code
- Makes lots of modifications in one "big bang" change
Contributing Code Changes
Please review the preceding section before proposing a code change. This section documents how to do so.
When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project's open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project's open source license and warrant that you have the legal authority to do so.
Generally, Spark uses JIRA to track logical issues, including bugs and improvements, and uses Github pull requests to manage the review and merge of specific code changes. That is, JIRAs are used to describe what should be fixed or changed, and high-level approaches, and pull requests describe how to implement that change in the project's source code. For example, major design decisions are discussed in JIRA.
Find the existing Spark JIRA that the change pertains to.
Do not create a new JIRA if creating a change to address an existing issue in JIRA; add to the existing discussion and work instead
Look for existing pull requests that are linked from the JIRA, to understand if someone is already working on the JIRA
If the change is new, then it usually needs a new JIRA. However, trivial changes, where the what should change is virtually the same as the how it should change do not require a JIRA. Example: "Fix typos in Foo scaladoc"
If required, create a new JIRA:
Provide a descriptive Title. "Update web UI" or "Problem in scheduler" is not sufficient. "Kafka Streaming support fails to handle empty queue in YARN cluster mode" is good.
Write a detailed Description. For bug reports, this should ideally include a short reproduction of the problem. For new features, it may include a design document.
Set required fields:
Issue Type. Generally, Bug, Improvement and New Feature are the only types used in Spark.
Priority. Set to Major or below; higher priorities are generally reserved for committers to set. JIRA tends to unfortunately conflate "size" and "importance" in its Priority field values. Their meaning is roughly:
Blocker: pointless to release without this change as the release would be unusable to a large minority of users
Critical: a large minority of users are missing important functionality without this, and/or a workaround is difficult
Major: a small minority of users are missing important functionality without this, and there is a workaround
Minor: a niche use case is missing some support, but it does not affect usage or is easily worked around
Trivial: a nice-to-have change but unlikely to be any problem in practice otherwise
Affects Version. For Bugs, assign at least one version that is known to exhibit the problem or need the change
Do not set the following fields:
Fix Version. This is assigned by committers only when resolved.
Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version.
Do not include a patch file; pull requests are used to propose the actual change. (Changes to the Spark site, outside of docs/, must use a patch since these are not hosted in Github.)
If the change is a large change, consider inviting discussion on the issue at email@example.com first before proceeding to implement the change.
Clone your fork, create a new branch, push commits to the branch.
Consider whether documentation or tests need to be added or updated as part of the change, and add them as needed.
Run all tests with
/dev/run-teststo verify that the code still compiles, passes tests, and passes style checks.
If style checks fail, review the Spark Code Style Guide
Open a pull request against the master branch of apache/spark. (Only in special cases would the PR be opened against other branches.)
The PR title should be of the form
[SPARK-xxxx] [COMPONENT]Title, where SPARK-xxxx is the relevant JIRA number, COMPONENT is one of the PR categories shown at https://spark-prs.appspot.com/ and Title may be the JIRA's title or a more specific title describing the PR itself.
If the pull request is still a work in progress, and so is not ready to be merged, but needs to be pushed to Github to facilitate review, then add
[WIP]after the component.
Consider identifying committers or other contributors who have worked on the code being changed. Find the file(s) in Github and click "Blame" to see a line-by-line annotation of who changed the code last. You can add
@usernamein the PR description to ping them immediately.
Please state that the contribution is your original work and that you license the work to the project under the project's open source license.
The related JIRA, if any, will be marked as "In Progress" and your pull request will automatically be linked to it. There is no need to be the Assignee of the JIRA to work on it, though you are welcome to comment that you have begun work.
The Jenkins automatic pull request builder will test your changes
If it is your first contribution, Jenkins will wait for confirmation before building your code and post "Can one of the admins verify this patch?"
A committer can authorize testing with a comment like "ok to test"
A committer can automatically allow future pull requests from a contributor to be tested with a comment like "Jenkins, add to whitelist"
After about 1.5 hours, Jenkins will post the results of the test to the pull request, along with a link to the full results on Jenkins.
Watch for the results, and investigate and fix failures promptly
Fixes can simply be pushed to the same branch from which you opened your pull request
Jenkins will automatically re-test when new commits are pushed
If the tests failed for reasons unrelated to the change (e.g. Jenkins outage), then a committer can request a re-test with "Jenkins, retest this please". Ask if you need a test restarted.
The Review Process
Other reviewers, including committers, may comment on the changes and suggest modifications. Changes can be added by simply pushing more commits to the same branch.
Lively, polite, rapid technical debate is encouraged from everyone in the community. The outcome may be a rejection of the entire change.
Reviewers can indicate that a change looks suitable for merging with a comment such as: "I think this patch looks good". Spark uses the LGTM convention for indicating the strongest level of technical sign-off on a patch: simply comment with the word "LGTM". It specifically means: "I've looked at this thoroughly and take as much ownership as if I wrote the patch myself". If you comment LGTM you will be expected to help with bugs or follow-up issues on the patch. Consistent, judicious use of LGTMs is a great way to gain credibility as a reviewer with the broader community.
Sometimes, other changes will be merged which conflict with your pull request's changes. The PR can't be merged until the conflict is resolved. This can be resolved with "git fetch origin" followed by "git merge origin/master" and resolving the conflicts by hand, then pushing the result to your branch.
Try to be responsive to the discussion rather than let days pass between replies
Closing Your Pull Request / JIRA
If a change is accepted, it will be merged and the pull request will automatically be closed, along with the associated JIRA if any
Note that in the rare case you are asked to open a pull request against a branch besides master, that you will actually have to close the pull request manually
The JIRA will be Assigned to the primary contributor to the change as a way of giving credit. If the JIRA isn't closed and/or Assigned promptly, comment on the JIRA.
If your pull request is ultimately rejected, please close it promptly
... because committers can't close PRs directly
Pull requests will be automatically closed by an automated process at Apache after about a week if a committer has made a comment like "mind closing this PR?" This means that the committer is specifically requesting that it be closed.
If a pull request has gotten little or no attention, consider improving the description or the change itself and ping likely reviewers again after a few days. Consider proposing a change that's easier to include, like a smaller and/or less invasive change.
If it has been reviewed but not taken up after weeks, after soliciting review from the most relevant reviewers, or, has met with neutral reactions, the outcome may be considered a "soft no". It is helpful to withdraw and close the PR in this case.
- If a pull request is closed because it is deemed not the right approach to resolve a JIRA, then leave the JIRA open. However if the review makes it clear that the issue identified in the JIRA is not going to be resolved by any pull request (not a problem, won't fix) then also resolve the JIRA.