This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

Developing with the Python SDK

Gradle can build and test python, and is used by the Jenkins jobs, so needs to be maintained.

You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you do want to use Python tools directly, we recommend setting up a virtual environment before testing your code.

If you update any of the cythonized files in Python SDK, you must install the cython package before running following command to properly test your code.

The following commands should be run in the sdks/python directory. This installs Python from source and includes the test and gcp dependencies.

On macOS/Linux:

# Initialize virtual environment called "env" in local directory.
virtualenv env

# Activate virtual environment.
. ./env/bin/activate (env)

# Install packages.
pip install -e .[gcp,test]

On Windows:

> c:\Python27\python.exe -m virtualenv > env\Scripts\activate (env) > pip install -e .[gcp,test]

This command runs all Python tests. The nose dependency is installed by [test] in pip install.

(env) $ python nosetests

You can use following command to run a single test method.

(env) $ python nosetests --tests <module>:<test class>.<test method>

For example:

(env) $ python nosetests --tests

You can deactivate the virtualenv when done.

(env) $ deactivate

To check just for Python lint errors, run the following command.

$ ../../gradlew lint

Or use tox commands to run the lint tasks:

$ tox -e py27-lint # For python 2.7
$ tox -e py3-lint # For python 3
$ tox -e py27-lint3 # For python 2-3 compatibility

Remote testing

This step is only required for testing SDK code changes remotely (not using directrunner). In order to do this you must build the Beam tarball. From the root of the git repository, run:

$ cd sdks/python/
$ python sdist

Pass the --sdk_location flag to use the newly built version.

For example:

$ python sdist > /dev/null && \
    python -m apache_beam.examples.wordcount ... \
    --sdk_location dist/apache-beam-2.5.0.dev0.tar.gz

Run hello world against modified SDK Harness

# Build the Flink job server (default job server for PortableRunner) that stores the container locally.
./gradlew :beam-runners-flink_2.11-job-server-container:docker

# Build portable SDK Harness which builds and stores the container locally.
./gradlew :beam-sdks-python-container:docker

# Run the pipeline.
python -m apache_beam.examples.wordcount --runner PortableRunner --input <local input file> --output <local output file>

Run hello world against modified Dataflow Fn API Runner Harness and SDK Harness

# Build portable worker
./gradlew :beam-runners-google-cloud-dataflow-java-fn-api-worker:build -x spotlessJava -x rat -x test
./gradlew :beam-runners-google-cloud-dataflow-java-fn-api-worker:shadowJar

# Build portable Pyhon SDK harness and publish it to GCP
./gradlew$USER/beam -p sdks/python/container docker
gcloud docker -- push$USER/beam/python:latest

# Initialize python
cd sdks/python
virtualenv env
. ./env/bin/activate

# run pipeline
python -m apache_beam.examples.wordcount   --runner DataflowRunner   --num_workers 1   --project <gcp_project_name>   --output <gs://path>   --temp_location <gs://path>   --worker_harness_container_image$USER/beam/python:latest   --experiment beam_fn_api   --sdk_location build/apache-beam-2.12.0.dev0.tar.gz   --dataflow_worker_jar './runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.12.0-SNAPSHOT.jar'   --debug

  • No labels