Installing Python interpreters
Apache Beam supports multiple Python versions. You might be able to iterate on the Beam code using one Python version provided by your OS, assuming this version is also supported by Beam. However you will need to have interpreters for all supported versions to be able to run test suites locally using Gradle, and to work on Beam releases. Therefore, we recommend installing a Python interpreter for each supported version or launching a docker-based development environment that should have these interpreters preinstalled using: start-build-env.sh .
There are several ways how you might install multiple Python versions.
- You can download, build and install CPython from sources.
- If you are an Ubuntu user, you could add a third-party repository 'Deadsnakes' and install the missing versions via apt. If you install from Deadsnakes, make sure to also install
- You can use PyEnv to download and install Python versions (Recommended).
Installation steps may look as follows:
- Follow the steps below in How to setup pyenv.
Install Python intepreter for each supported Python minor version. For example:
For major.minor.patch versions currently used by Jenkins cluster, see Current Installations.
Make installed interpreters available in your shell by running
(OPTIONAL) Pyenv will sometimes fail to make these interpreters directly available without a local configuration. If you see errors trying to use
python3.x, then run also
After these steps, all
python3.x interpreters should be available in your shell. The first version in the list passed to
pyenv global will be used as default
python / python3 interpreter if the minor version is not specified.
Please contribute to these instructions if they didn't work for you.
Developing the Python SDK
- Beam Gradle tooling can build and test Python SDK, and Jenkins jobs use it, so it needs to be maintained.
- You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you want to use Python tools directly, we recommend setting up a virtual environment before testing your code.
- The commands below assume that you're in the
Virtual Environment Setup
Setting up a virtual environment is required for running tests directly, such as via
pytest or an IDE like PyCharm. To create an environment, installs Python SDK from the sources with test and GCP dependencies, follow the below instructions:
Use the following code:
For certain systems, particularly Macs with M1 chips, this installation method may not generate urns correctly. If running
python gen_protos.pydoesn't resolve the issue, consult https://github.com/apache/beam/issues/22742#issuecomment-1218216468 for further guidance.
Use the following code:
You can deactivate the
Virtual Environments with
- A more advanced option,
pyenvallows you to download, build, and install locally any version of Python, regardless of which versions your distribution supports.
pyenvalso has a
virtualenvplugin, which manages the creation and activation of
- The caveat is that you'll have to take care of any build dependencies, and those are probably still constrained by your distribution.
- These instructions were made on a Linux/Debian system.
How to setup
pyenv (with pyenv-virtualenv plugin)
Install prerequisites for your distribution.
- Add the required lines to
~/.bashrc(as returned by the script).
- Note (12/10/2021): You may have to manually modify .bashrc as described here: https://github.com/pyenv/pyenv-installer/issues/112#issuecomment-971964711. Remove this note if no longer applicable.
- Open a new shell. If
pyenvcommand is still not available in PATH, you may need to restart the login session.
Example on Ubuntu:
Example: How to Run Unit Tests with PyCharm Using Python 3.8.9 in a
- Install Python 3.8.9 and create a
Upgrade packages (recommended)
- Set up PyCharm
- Start by adding a new project interpreter (from the bottom right or in Settings).
- Select Existing environment and the interpreter, which should be under
- Switch interpreters at the bottom right.
Cleaning up environments
To delete all environments created with pyenv, run:
If you have issues, find troubleshooting at
pyenv common build problems.
Error: No module named distutils. (23/07/2021)
As of 23/07/2021, users of some versions of Debian are currently experiencing the error "ModuleNotFoundError: No module named 'distutils.util'" when using the Python Beam SDK. This is typically because a desired version of Python interpreter is no longer available in the distribution. This can be fixed by installing Python via pyenv, rather than relying on the packages included with the Debian distribution. See Installing Python interpreters above.
The error may also manifest in environments created with
virtualenv tool even with Python interpreters installed via pyenv. The workaround can be to downgrade to
virtualenv==16.1 or use pyenv or venv to create virtual environments. You will also likely need to clean the previously created environment:
rm -rf /path/to/beam/build/gradlenv
If you update any of the cythonized files in Python SDK, you must first install the
cython package, and run the following command before testing your changes
Running Tests Using
If you've set up a
virtualenv above, you can now run tests directly using
Running Integration Tests Using
To run an integration test you may need to specify additional parameters for the runner.
Unless you are using Direct runner, you must build the Beam SDK tarball. To do so, run the following commands from the root of the git repository.
We will use the tarball built by this command in the
It is helpful to emit the test logs to console immediately when they occur. You can do so with the
-o log_cli=True option. You could additionally customize the logging level with the
Timeouts in Integration Tests
While integration tests running on Jenkins have timeouts that are set with an adequate buffer (4500 secs), tests that are invoked locally via
python -m pytest ... may encounter timeout failures. This is because the
timeout property defined in our
pytest.ini file is set to 600 secs, which may not be enough time for a particular integration test. To get around this, either change the value of
timeout to a higher number, or add a
pytest timeout decorator above the function(s) inside your
For more information about timeouts in
pytest, go to this site.
Running Unit Tests Using tox
Here are some tips for running tests using tox:
- Tox does not require a
virtualenvwith Beam + dependencies installed. It creates its own.
- It also runs tests faster, utilizing multiple processes (via
- For a list of environments, run
- tox also supports passing arguments after double dashes to
Execute the following code for running tests using tox:
Running Tests Using gradle
Integration tests suites on Jenkins are configured in groovy files that launch certain gradle tasks (example). You could launch test suites locally by executing the gradle targets directly (for example: .
/gradlew :sdks:python:test-suites:dataflow:py37:postCommitPy37). This option may only be available to committers, as by default the test suites are configured to use
Lint and Formatting Checks
To run all consistency checks, run the following commands:
To auto-format the code in place, run:
Running lint and
yapf Automatically on Each Commit with pre-commit Hooks
To enable pre-commit, run:
When the pre-commit hook for
yapfapplies formatting changes in place, the check fails with an error
files were modified by this hook, and you have to re-run `
To disable the pre-commit, run:
Running yapf formatter manually
To run manually:
For consistency, use the current version of
To format changed files in your branch:
To format just a single directory:
To format files with uncommitted changes, run:
- If you need to exclude one particular file or pattern from formatting, add it to the
Run hello world against modified SDK Harness
To run a
hello world against modified SDK Harness, execute the following code:
Run hello world against modified Dataflow Fn API Runner Harness and SDK Harness
To run a
hello world against modified Dataflow Fn API Runner Harness and SDK Harness, execute the following code:
Run Integration Test from IDE (TODO: please revise these instructions now that we migrated to PyTest)
To run an integration test from an IDE in a debug mode, you can create a Nosetests configuration. For example, to run a VR test on Dataflow runner from IntelliJ/PyCharm, you can adjust the configuration as follows:
- Set Target to
Moduleand point to the
Set Additional arguments (sample, adjust as needed):
- Set Working directory to
screen diff integration Test for Interactive Beam
For Interactive Beam/Notebooks, we need to verify if the visual presentation of executing a notebook is stable. A
screen diff integration test that executes a test notebook and compares results with a golden screenshot does the trick. To run a
screen diff integration Test for Interactive Beam:
Execute the following code for preparation work:
To run the tests:
Golden screenshots are temporarily taken and stored by the system platform. The currently supported platforms are Darwin (macOS) and Linux.
Each test will generate a stable unique hexadecimal id. The golden screenshots are named after that id.
To add new tests, put a new test
.ipynb) under the
Add a single test under
apache_beam/runners/interactive/testing/integration/testsdirectory. A test is simple as:
How to Install an Unreleased Python SDK without Building It
SDK source zip archive and wheels are continuously built after merged commits to https://github.com/apache/beam
- Click on a recent `Build python source distribution and wheels job` that ran successfully on the github.com/apache/beam master branch from this list.
- Click on List files on Google Cloud Storage Bucket on the right-side panel.
- Expand List file on Google Cloud Storage Bucket in the main panel.
- Locate and Download the ZIP file. For example,
- It’s simplest to download the file using your browser by replacing the prefix “gs://” with “https://storage.googleapis.com/” . For example, https://storage.googleapis.com/beam-wheels-staging/master/02bf081d0e86f16395af415cebee2812620aff4b-207975627/apache-beam-2.25.0.dev0.zip
- Or follow these instructions to download using the
Install the downloaded zip file. For example:
--sdk_locationflag pointed at the same ZIP file.
How to update dependencies that are installed in Python container images
When we build Python container images for Apache Beam SDK, we install PyPI packages of Apache Beam and some additional PyPi dependencies that will likely benefit users. The complete list of dependencies is specified in base_image_requirements.txt files, for each Python minor version. These files are generated from Beam SDK requirements, specified in setup.py, and a short list of additional dependencies specified in base_image_requirements_manual.txt.
We expect all Beam dependencies (including transitive dependencies, and deps for some of the 'extra's, like [gcp]) to be specified with exact versions in the requirements files. Therefore, you may need to regenerate the requirements files when you modify Python SDKs dependencies in setup.py.
Regenerate the requirements files by running:
./gradlew :sdks:python:container:generatePythonRequirementsAll and commiting the changes. Execution can take up to 5 min per Python version and is somewhat resource-demanding. You can also regenerate the dependencies individually per version with targets like
To run the command successfully, you will need Python interpreters for all versions supported by Beam. See: Installing Python Interpreters.
NOTE for RELEASE MANAGERS: The updated Python dependency files must be merged into Beam's
master branch before cutting the release branch.
You may see that the pip command will lead to segmentation fault as well. If this happens, remove the python version from pyenv, and reinstall the version like this.
There have been issues with older Python versions. See here for details.