Contributing to Spark

The Apache Spark team welcomes all types of contributions, whether they be bug reports, documentation, or new patches.

Reporting Issues

If you'd like to report a bug in Spark or ask for a new feature, open an issue on the Apache Spark JIRA. For general usage help, you should email the user mailing list.

Contributing Code

We prefer to receive contributions in the form of GitHub pull requests. Start by opening an issue for your change on the Spark Project JIRA (and make sure to search whether there's an existing issue). For code reviews, we use the github.com/apache/spark repository.

Please follow the steps below to propose a contribution:

Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
Review the criteria for inclusion of patches.
Create an issue for your patch on the Spark Project JIRA.
If you are proposing a larger change, attach a design document to your JIRA first (example) and email the dev mailing list to discuss it.
Submit the patch as a GitHub pull request. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Name your pull request with the JIRA name and include the Spark module or WIP if relevant. NOTE: If you do not reference a JIRA in the title - you may not be credited in our release notes, since our credits are generated by JIRA.
Follow the Spark Code Style Guide. Before sending in your pull request, you can run ./dev/lint-scala and ./dev/lint-python to validate the style.
Make sure that your code passes the automated tests (see Automated Testing below)
Add new tests for your code. We use ScalaTest for testing. Just add a new Suite in core/src/test, or methods to an existing Suite.
Update the documentation (in the docs folder) if you add a new feature or configuration parameter.

If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list.

Tip: Use descriptive names in your pull requests

SPARK-123: Add some feature to Spark

[STREAMING] SPARK-123: Add some feature to Spark streaming

[MLLIB] [WIP] SPARK-123: Some potentially useful feature for MLLib

Criteria for Inclusion or Rejection of Patches

When Spark committers consider a patch for merging, we take several factors into account. Certain types of patches will be reviewed and merged almost instantly: patches that address correctness issues in Spark, are small, and/or benefit a wide number of users are likely to get a lot of attention. Other patches might take more time to review. In a small number of cases, patches are rejected. Patches might be rejected for the following reasons:

Correctness concerns: If a patch touching a lot of code and it is difficult to verify it's correctness, it might be rejected.
User space functionality: If a patch is adding features that could exist in a third-party package rather than Spark itself, we sometimes encourage users to publish utilities in their own library. This is especially true for large standalone modules.
Too complex: Spark desires to have a maintainable and simple codebase. If features are very complex relative to their benefit, they may be rejected.
Regressing behavior: If a patch regresses behavior that is implicitly or explicitly depended on by users, it might be rejected on this basis.
Introducing new API's: Patches that propose new public or experimental API's must meet a high bar in Spark due to our API compatibility guidelines.
Not applicable to enough users: Optimizations or features might be rejected on the basis of being too esoteric and not useful to a broad enough audience.
Introduction of dependencies: Due to the complex nature of Spark, we are conservative about introducing new dependencies. If patches add new dependencies to Spark, they may not be merged.

Small patches are almost never rejected, so it's a good strategy to start with small patches for new contributors. Keep in mind that Spark committers act as volunteers - patches with major correctness issues might be rejected without significant review, since such review is very costly in terms of time. If this happens consider finding smaller patches or simpler features to contribute, then building up your confidence and abilities over time.

Contributing New Algorithms to MLLib

While a rich set of algorithms is an important goal for MLLib, scaling the project requires that maintainability, consistency, and code quality come first. New algorithms should

Be widely known
Be used and accepted (academic citations and concrete use cases can help justify this)
Be highly scalable
Be well documented
Have APIs consistent with other algorithms in MLLib that accomplish the same thing
Come with a reasonable expectation of developer support.

Automated Testing

Spark comes with a fairly comprehensive suite for unit tests, functional tests and integration tests. All pull requests are automatically tested on Jenkins, currently hosted by the Berkeley AMPLab.

To run through the whole suite for tests (along with code style check and binary compatibility checks), run /dev/run-tests.

Starter Tasks

If you are new to Spark and want to contribute, you can browse through the list of starter tasks on our JIRA. These tasks are typically small and simple, and are excellent problems to get you ramped up.

Documentation

If you'd like to contribute documentation, there are two ways:

To have us add a link to an external tutorial you wrote, simply email the developer mailing list.
To modify the built-in documentation, edit the MarkDown source files in Spark's docs directory, and send a patch against the Spark GitHub repository. The README file in docs says how to build the documentation locally to test your changes.

Development Discussions

To keep up to date with the latest discussions, join the developer mailing list.

IDE Setup

IntelliJ

While many of the Spark developers use SBT or Maven on the command line, the most common IDE we use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.

To create a Spark project for IntelliJ, simply check out the repository and use the import functionality in IntelliJ to import Spark as a Maven project.

Eclipse

Eclipse can be used to develop and test Spark. The following configuration is known to work:

Eclipse Juno
Scala IDE v 3.0.3
Scala Test

Scala IDE can be installed using Help | Eclipse Marketplace... and search for Scala IDE. Remember to include Scala Test as a Scala IDE plugin. To install Scala Test after installing Scala IDE, follow these steps:

Select Help | Install New Software
Select http://download.scala-ide.org... in the "Work with" combo box
Expand Scala IDE plugins, select ScalaTest for Scala IDE and install

SBT can create Eclipse .project and .classpath files. To create these files for each Spark sub project, use this command:

sbt/sbt eclipse

To import a specific project, e.g. spark-core, select File | Import | Existing Projects into Workspace. Do not select "Copy projects into workspace". Importing all Spark sub projects at once is not recommended.

ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test.

If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini in the Eclipse install directory. Increase the following setting as needed:

--launcher.XXMaxPermSize
256M

ScalaTest Issues

If the following error occurs when running ScalaTest

An internal error occurred during: "Launching XYZSuite.scala".
java.lang.NullPointerException

It is due to an incorrect Scala library in the classpath. To fix it, right click on project, select Build Path | Configure Build Path

Add Library | Scala Library
Remove scala-library-2.10.4.jar - lib_managed\jars

In the event of "Could not find resource path for Web UI: org/apache/spark/ui/static", it's due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:

sbt/sbt "test-only org.apache.spark.rdd.SortingSuite"

Child pages