Developing with the Python SDK
- Gradle can build and test Python, and the Jenkins jobs use it, so it needs to be maintained.
- You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you want to use Python tools directly, we recommend setting up a virtual environment before testing your code.
If you update any of the cythonized files in Python SDK, you must install the
cythonpackage before running the following command to test your code:
- The commands below assume that you're in the
Virtual Environment Setup
Setting up a
virtualenv is required for running tests directly, such as via
pytest or an IDE like PyCharm. This installs Python SDK from the source and includes the test and GCP dependencies.
Use the following code:
Use the following code:
You can deactivate the
Virtual Environments with
- A more advanced option,
pyenvallows you to download, build, and install locally any version of Python, regardless of which versions your distribution supports.
pyenvalso has a
virtualenvplugin, which manages the creation and activation of
- The caveat is that you'll have to take care of any build dependencies, and those are probably still constrained by your distribution.
- These instructions were made on a Linux/Debian system.
Error: No module named distutils. (23/07/2021)
As of 23/07/2021, users of some versions of Debian are currently experiencing the error "ModuleNotFoundError: No module named 'distutils.util'" when using the Python Beam SDK. This can be fixed by using pyenv for your Python installation, rather than relying on the packages included with the Debian distribution.
How to setup
pyenv (with pyenv-virtualenv plugin)
Install prerequisites for your distribution.
- Add the required lines to
~/.bashrc(as returned by the script).
- Open a new shell.
Example on Ubuntu:
Example: How to Run Unit Tests with PyCharm Using Python 3.7.9 in a
- Install Python 3.7.9 and create a
Upgrade packages (recommended)
- Set up PyCharm
- Start by adding a new project interpreter (from the bottom right or in Settings).
- Select Existing environment and the interpreter, which should be under
- Switch interpreters at the bottom right.
Cleaning up environments
To delete all environments created with pyenv, run:
If you have issues, find troubleshooting at
pyenv common build problems.
Running Tests Using
If you've set up a
virtualenv above, you can now run tests directly using
Running Tests Using tox
Here are some tips for running tests using tox:
- Tox does not require a
virtualenvwith Beam + dependencies installed. It creates its own.
- It also runs tests faster, utilizing multiple processes (via
- For a list of environments, run
- tox also supports passing arguments after double dashes to
Execute the following code for running tests using tox:
Lint and Formatting Checks
To run all consistency checks, run the following commands:
To auto-format the code in place, run:
Running lint and
yapf Automatically on Each Commit with pre-commit Hooks
To enable pre-commit, run:
When the pre-commit hook for
yapfapplies formatting changes in place, the check fails with an error
files were modified by this hook, and you have to re-run `
To disable the pre-commit, run:
Running yapf formatter manually
To run manually:
For consistency, use the current version of
To format changed files in your branch:
To format just a single directory:
To format files with uncommitted changes, run:
- If you need to exclude one particular file or pattern from formatting, add it to the
You required this step only for testing SDK code changes remotely (not using
directrunner). To do this you must build the Beam tarball.
From the root of the git repository, run:
--sdk_locationflag to use the newly built version. For example:
Run hello world against modified SDK Harness
To run a
hello world against modified SDK Harness, execute the following code:
Run hello world against modified Dataflow Fn API Runner Harness and SDK Harness
To run a
hello world against modified Dataflow Fn API Runner Harness and SDK Harness, execute the following code:
Run Integration Test
To run an integration test, execute the following code:
–sdk_locationflag if tar ball is needed and built from
python setup.py sdist, otherwise tar ball under default location (target directory of Gradle build) will be used.
Run a ValidatesRunner test
To run a ValidatesRunner test, execute the following code:
ValidatesRunneran attribute from
apache_beam/transforms/util_test.pyin streaming mode. You can manually edit the attribute in the test file so that
ValidatesRunner1limit which tests you would like to run.
Run Integration Test from IDE
To run an integration test from an IDE in a debug mode, you can create a Nosetests configuration. For example, to run a VR test on Dataflow runner from IntelliJ/PyCharm, you can adjust the configuration as follows:
- Set Target to
Moduleand point to the
Set Additional arguments (sample, adjust as needed):
- Set Working directory to
screen diff integration Test for Interactive Beam
For Interactive Beam/Notebooks, we need to verify if the visual presentation of executing a notebook is stable. A
screen diff integration test that executes a test notebook and compares results with a golden screenshot does the trick. To run a
screen diff integration Test for Interactive Beam:
Execute the following code for preparation work:
To run the tests:
Golden screenshots are temporarily taken and stored by the system platform. The currently supported platforms are Darwin (macOS) and Linux.
Each test will generate a stable unique hexadecimal id. The golden screenshots are named after that id.
To add new tests, put a new test
.ipynb) under the
Add a single test under
apache_beam/runners/interactive/testing/integration/testsdirectory. A test is simple as:
How to Install an Unreleased Python SDK without Building It
SDK source zip archive and wheels are continuously built after merged commits to https://github.com/apache/beam
- Click on a recent `Build python source distribution and wheels job` that ran successfully on the github.com/apache/beam master branch from this list.
- Click on List files on Google Cloud Storage Bucket on the right-side panel.
- Expand List file on Google Cloud Storage Bucket in the main panel.
- Locate and Download the ZIP file. For example,
- It’s simplest to download the file using your browser by replacing the prefix “gs://” with “https://storage.googleapis.com/” . For example, https://storage.googleapis.com/beam-wheels-staging/master/02bf081d0e86f16395af415cebee2812620aff4b-207975627/apache-beam-2.25.0.dev0.zip
- Or follow these instructions to download using the
Install the downloaded zip file. For example:
--sdk_locationflag pointed at the same ZIP file.