Useful Developer Tools

Reducing Build Times

Spark's default build strategy is to assemble a jar including all of its dependencies. This can be cumbersome when doing iterative development. When developing locally, it is possible to create an assembly jar including all of Spark's dependencies and then re-package only Spark itself when making changes.

Fast Local Builds

$ build/sbt clean assembly # Create a normal assembly
$ ./bin/spark-shell # Use spark with the normal assembly
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
# ... do some local development ... #
$ build/sbt compile
# ... do some local development ... #
$ build/sbt compile
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell # Back to normal, using Spark classes from the assembly jar
 
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ build/sbt ~compile

Note: in some earlier versions of Spark, fast local builds used a sbt task called assemble-deps; SPARK-1843 removed assemble-deps and introduced the environment variable described above. For those older versions:

Fast Local Builds

$ build/sbt clean assemble-deps
$ build/sbt package
# ... do some local development ... #
$ build/sbt package
# ... do some local development ... #
$ build/sbt package
# ...
 
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ build/sbt ~package

Checking Out Pull Requests

Git provides a mechanism for fetching remote pull requests into your own local repository. This is useful when reviewing code or testing patches locally. If you haven't yet cloned the Spark Git repository, use the following command:

$ git clone https://github.com/apache/spark.git
$ cd spark

To enable this feature you'll need to configure the git remote repository to fetch pull request data. Do this by modifying the .git/config file inside of your Spark directory. The remote may not be named "origin" if you've named it something else:

.git/config

[remote "origin"]
  url = git@github.com:apache/spark.git
  fetch = +refs/heads/*:refs/remotes/origin/*
  fetch = +refs/pull/*/head:refs/remotes/origin/pr/*   # Add this line

Once you've done this you can fetch remote pull requests

# Fetch remote pull requests
$ git fetch origin
# Checkout a remote pull request
$ git checkout origin/pr/112
# Create a local branch from a remote pull request
$ git checkout origin/pr/112 -b new-branch

Running Individual Tests

Often it is useful to run individual tests in Maven or SBT.

# sbt
$ build/sbt "test-only org.apache.spark.io.CompressionCodecSuite"
$ build/sbt "test-only org.apache.spark.io.*"

# Maven, run Scala test
$ build/mvn test -DwildcardSuites=org.apache.spark.io.CompressionCodecSuite -Dtest=none
$ build/mvn test -DwildcardSuites=org.apache.spark.io.* -Dtest=none
 
# The above will work, but will take time to iterate through each project.  If you want
# to only run tests in one subproject, first run "install", then use "-pl <project>"
# with the tests
$ build/mvn <options> install
$ build/mvn <other options> -pl org.apache.spark:spark-hive_2.11 test -DwildcardSuites=org.apache.spark.sql.hive.execution.HiveTableScanSuite -Dtest=none
 
# Maven, run Java test
$ build/mvn test -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite

Generating Dependency Graphs

$ # sbt
$ build/sbt dependency-tree

$ # Maven
$ build/mvn -DskipTests install
$ build/mvn dependency:tree

Running Build Targets For Individual Projects

$ # sbt
$ build/sbt assembly/assembly
$ # Maven
$ build/mvn package -DskipTests -pl assembly

ScalaTest Issues

If the following error occurs when running ScalaTest

An internal error occurred during: "Launching XYZSuite.scala".
java.lang.NullPointerException

It is due to an incorrect Scala library in the classpath. To fix it, right click on project, select Build Path | Configure Build Path

Add Library | Scala Library
Remove scala-library-2.10.4.jar - lib_managed\jars

In the event of "Could not find resource path for Web UI: org/apache/spark/ui/static", it's due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:

build/sbt "test-only org.apache.spark.rdd.SortingSuite"

Python Tests

There are some dependencies to run Python tests locally:

The unittests will run try to with Python 2.6 (which the oldest support version) if it's available, Python 2.6 needs unittest2 to run the tests, which could be installed by pip2.6 .

NumPy 1.4+ is needed for run MLlib Python tests, which should be also installed for Python 2.6.

After that, you can run all the Python unittests by

python/run-tests

R Tests

To run the SparkR tests you will need to install the R package 'testthat' (Run `install.packages(testthat)` from R shell). You can run just the SparkR tests using the command

R/run-tests.sh

Running Different Test Permutations on Jenkins

When running tests for a pull request on Jenkins, you can add special phrases to the title of your pull request to change testing behavior. This includes:

"[test-maven]" - signals to test the pull request using maven
"[test-hadoop1.0]" - signals to test using Spark's Hadoop 1.0 profile (other options include Hadoop 2.0, 2.2, and 2.3)

Running Docker integration tests

In order to run Docker integration tests, you have to install docker engine on your box. The instructions for installation can be found at https://docs.docker.com/engine/installation/. Once installed, the docker service needs to be started, if not already running. On Linux, this can be done by sudo service docker start.
These integration tests run as a part of a regular Spark unit test run, therefore, it's necessary for the Docker engine to be installed and running if you want all Spark tests to pass.

Organizing Imports

You can use a IntelliJ Imports Organizer from Aaron Davidson to help you organize the imports in your code. It can be configured to match the import ordering from the style guide.

IDE Setup

IntelliJ

While many of the Spark developers use SBT or Maven on the command line, the most common IDE we use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.

To create a Spark project for IntelliJ:

Download IntelliJ and install the Scala plug-in for IntelliJ.
Go to "File -> Import Project", locate the spark source directory, and select "Maven Project".
In the Import wizard, it's fine to leave settings at their default. However it is usually useful to enable "Import Maven projects automatically", since changes to the project structure will automatically update the IntelliJ project.
As documented in Building Spark, some build configurations require specific profiles to be enabled. The same profiles that are enabled with -P[profile name] above may be enabled on the Profiles screen in the Import wizard. For example, if developing for Hadoop 2.4 with YARN support, enable profiles yarn and hadoop-2.4. These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section.

Other tips:

"Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources.
Some of the modules have pluggable source directories based on Maven profiles (i.e. to support both Scala 2.11 and 2.10 or to allow cross building against different versions of Hive). In some cases IntelliJ's does not correctly detect use of the maven-build-plugin to add source directories. In these cases, you may need to add source locations explicitly to compile the entire project. If so, open the "Project Settings" and select "Modules". Based on your selected Maven profiles, you may need to add source folders to the following modules:
- - spark-hive: add v0.13.1/src/main/scala
  - spark-streaming-flume-sink: add target\scala-2.10\src_managed\main\compiled_avro
Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports. If you try to build any of the projects using quasiquotes (eg., sql) then you will need to make that jar a compiler plugin (just below "Additional compiler options"). Otherwise you will see errors like:

/Users/irashid/github/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Error:(147, 9) value q is not a member of StringContext
 Note: implicit class Evaluate2 is not applicable here because it comes after the application point and it lacks an explicit result type
        q"""
        ^

Eclipse

Eclipse can be used to develop and test Spark. The following configuration is known to work:

Eclipse Juno
Scala IDE 4.0
Scala Test

The easiest way is to download the Scala IDE bundle from the Scala IDE download page. It comes pre-installed with ScalaTest. Alternatively, use the Scala IDE update site or Eclipse Marketplace.

SBT can create Eclipse .project and .classpath files. To create these files for each Spark sub project, use this command:

sbt/sbt eclipse

To import a specific project, e.g. spark-core, select File | Import | Existing Projects into Workspace. Do not select "Copy projects into workspace".

If you want to develop on Scala 2.10 you need to configure a Scala installation for the exact Scala version that’s used to compile Spark. At the time of this writing that is Scala 2.10.4. Since Scala IDE bundles the latest versions (2.10.5 and 2.11.6 at this point), you need do add one in Eclipse Preferences -> Scala -> Installations by pointing to the lib/ directory of your Scala 2.10.4 distribution. Once this is done, select all Spark projects and right-click, choose Scala -> Set Scala Installation and point to the 2.10.4 installation. This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now.

ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test.

If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini in the Eclipse install directory. Increase the following setting as needed:

--launcher.XXMaxPermSize
256M

Nightly Builds

Packages are built regularly off of Spark's master branch and release branches. These provide Spark developers access to the bleeding-edge of Spark master or the most recent fixes not yet incorporated into a maintenance release. These should only be used by Spark developers, as they may have bugs and have not undergone the same level of testing as releases. Spark nightly packages are available at:

Spark also publishes SNAPSHOT releases of its Maven artifacts for both master and maintenance branches on a nightly basis. To link to a SNAPSHOT you need to add the ASF snapshot repository to your build. Note that SNAPSHOT artifacts are ephemeral and may change or be removed. To use these you must add the ASF snapshot repository at http://repository.apache.org/snapshots/

groupId: org.apache.spark

artifactId: spark-core_2.10

version: 1.5.0-SNAPSHOT

Child pages