The Apache Spark team welcomes all types of contributions, whether they be bug reports, providing help to new users, documentation, or code patches.
- Reporting, Answering, and Triaging Issues
- Testing and Voting on Spark Releases
- Contributing Code
- Starter Tasks
- Development Discussions
- IDE Setup
Reporting, Answering, and Triaging Issues
The Spark community has two platforms for discussing user issues and requirements:
- The spark-user mailing list is the place to go for usage questions.
- The Apache Spark JIRA is the project's issue tracker, for reporting reproducible bugs or posting feature requests.
A great way to contribute to Spark is to help answer user questions on the mailing list. There are always many new Spark users; taking a few minutes to help answer a question is a very valuable community service! On the JIRA issue tracker, helping investigate, isolate, and reproduce bugs reported on the issue tracker is a great way to get more familiar with Spark components, and a good first step towards contributing code to those components.
Testing and Voting on Spark Releases
Spark's release process is community-oriented, and members of the community can vote on new releases on the spark-dev mailing list. Spark users are invited to test their workloads on newer release and provide feedback on any performance or correctness issues found in the newer release. This type of testing is a valuable and greatly appreciated contribution.
We prefer to receive contributions in the form of GitHub pull requests. Start by opening an issue for your change on the Spark Project JIRA (and make sure to search whether there's an existing issue). For code reviews, we use the github.com/apache/spark repository.
Please follow the steps below to propose a contribution:
- Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
- Review the criteria for inclusion of patches.
- Create an issue for your patch on the Spark Project JIRA.
- If you are proposing a larger change, attach a design document to your JIRA first (example) and email the dev mailing list to discuss it.
Submit the patch as a GitHub pull request. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Name your pull request with the JIRA name and include the Spark module or WIP if relevant. NOTE: If you do not reference a JIRA in the title - you may not be credited in our release notes, since our credits are generated by JIRA.
- Follow the Spark Code Style Guide. Before sending in your pull request, you can run
./dev/lint-pythonto validate the style.
- Make sure that your code passes the automated tests (see Automated Testing below)
- Add new tests for your code. We use ScalaTest for testing. Just add a new Suite in
core/src/test, or methods to an existing Suite.
- Update the documentation (in the
docsfolder) if you add a new feature or configuration parameter.
If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list.
Coding Style Guide and Interface Design
Please follow Spark Code Style Guide for coding style.
Please also read this presentation about interface design.
Criteria for Inclusion or Rejection of Patches
When Spark committers consider a patch for merging, we take several factors into account. Certain types of patches will be reviewed and merged almost instantly: patches that address correctness issues in Spark, are small, and/or benefit a wide number of users are likely to get a lot of attention. Other patches might take more time to review. In a small number of cases, patches are rejected. Patches might be rejected for the following reasons:
- Correctness concerns: If a patch touching a lot of code and it is difficult to verify it's correctness, it might be rejected.
- User space functionality: If a patch is adding features that could exist in a third-party package rather than Spark itself, we sometimes encourage users to publish utilities in their own library. This is especially true for large standalone modules.
- Too complex: Spark desires to have a maintainable and simple codebase. If features are very complex relative to their benefit, they may be rejected.
- Regressing behavior: If a patch regresses behavior that is implicitly or explicitly depended on by users, it might be rejected on this basis.
- Introducing new API's: Patches that propose new public or experimental API's must meet a high bar in Spark due to our API compatibility guidelines.
- Not applicable to enough users: Optimizations or features might be rejected on the basis of being too esoteric and not useful to a broad enough audience.
- Introduction of dependencies: Due to the complex nature of Spark, we are conservative about introducing new dependencies. If patches add new dependencies to Spark, they may not be merged.
Small patches are almost never rejected, so it's a good strategy to start with small patches for new contributors. Keep in mind that Spark committers act as volunteers - patches with major correctness issues might be rejected without significant review, since such review is very costly in terms of time. If this happens consider finding smaller patches or simpler features to contribute, then building up your confidence and abilities over time.
Code Review Process
Community code review is Spark's fundamental quality assurance process. When reviewing a patch, your goal should be to help streamline the committing process by giving committers confidence this patch has been verified by an additional party. It's encouraged to (politely) submit technical feedback to the author to identify areas for improvement or potential bugs.
If you feel a patch is ready for inclusion in Spark, indicate this to committers with a comment such as: "I think this patch looks good". Spark uses the LGTM convention for indicating the strongest level of technical sign-off on a patch: simply comment with the word "LGTM". An LGTM is a strong statement with specific semantics. It should be interpreted as the following: "I've looked at this thoroughly and take as much ownership as if I wrote the patch myself". If you comment LGTM you will be expected to help with bugs or follow-up issues on the patch. Consistent, judicious use of LGTM's is a great way to gain credibility as a reviewer with the broader community.
It's also welcome for reviewers to argue against the inclusion of a feature or patch. Simply indicate this in the comments.
Contributing New Algorithms to MLLib
While a rich set of algorithms is an important goal for MLLib, scaling the project requires that maintainability, consistency, and code quality come first. New algorithms should
- Be widely known
- Be used and accepted (academic citations and concrete use cases can help justify this)
- Be highly scalable
- Be well documented
- Have APIs consistent with other algorithms in MLLib that accomplish the same thing
- Come with a reasonable expectation of developer support.
Spark comes with a fairly comprehensive suite for unit tests, functional tests and integration tests. All pull requests are automatically tested on Jenkins, currently hosted by the Berkeley AMPLab.
To run through the whole suite for tests (along with code style check and binary compatibility checks), run /dev/run-tests.
If you are new to Spark and want to contribute, you can browse through the list of starter tasks on our JIRA. These tasks are typically small and simple, and are excellent problems to get you ramped up.
If you'd like to contribute documentation, there are two ways:
- To have us add a link to an external tutorial you wrote, simply email the developer mailing list.
- To modify the built-in documentation, edit the MarkDown source files in Spark's
docsdirectory, and send a patch against the Spark GitHub repository. The README file in
docssays how to build the documentation locally to test your changes.
To keep up to date with the latest discussions, join the developer mailing list.
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.
To create a Spark project for IntelliJ:
- Download IntelliJ and install the Scala plug-in for IntelliJ.
- Go to "File -> Import Project", locate the spark source directory, and select "Maven Project".
- In the Import wizard, it's fine to leave settings at their default. However it is usually useful to enable "Import Maven projects automatically", since changes to the project structure will automatically update the IntelliJ project.
- As documented in Building Spark, some build configurations require specific profiles to be enabled. The same profiles that are enabled with
-P[profile name]above may be enabled on the Profiles screen in the Import wizard. For example, if developing for Hadoop 2.4 with YARN support, enable profiles
hadoop-2.4. These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section.
- "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources.
- Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports. If you try to build any of the projects using quasiquotes (eg.,
sql) then you will need to make that jar a compiler plugin (just below "Additional compiler options"). Otherwise you will see errors like:
Eclipse can be used to develop and test Spark. The following configuration is known to work:
- Eclipse Juno
- Scala IDE v 3.0.3
- Scala Test
Scala IDE can be installed using
Help | Eclipse Marketplace... and search for Scala IDE. Remember to include Scala Test as a Scala IDE plugin. To install Scala Test after installing Scala IDE, follow these steps:
Help | Install New Software
http://download.scala-ide.org...in the "Work with" combo box
Scala IDE plugins, select
ScalaTest for Scala IDEand install
SBT can create Eclipse
.classpath files. To create these files for each Spark sub project, use this command:
To import a specific project, e.g. spark-core, select
File | Import | Existing Projects into Workspace. Do not select "Copy projects into workspace". Importing all Spark sub projects at once is not recommended.
ScalaTest can execute unit tests by right clicking a source file and selecting
Run As | Scala Test.
If Java memory errors occur, it might be necessary to increase the settings in
eclipse.ini in the Eclipse install directory. Increase the following setting as needed:
If the following error occurs when running ScalaTest
It is due to an incorrect Scala library in the classpath. To fix it, right click on project, select
Build Path | Configure Build Path
Add Library | Scala Library
scala-library-2.10.4.jar - lib_managed\jars
In the event of "
Could not find resource path for Web UI: org/apache/spark/ui/static", it's due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:
There are some dependencies to run Python tests locally:
The unittests will run try to with Python 2.6 (which the oldest support version) if it's available, Python 2.6 needs unittest2 to run the tests, which could be installed by pip2.6 .
NumPy 1.4+ is needed for run MLlib Python tests, which should be also installed for Python 2.6.
After that, you can run all the Python unittests by