You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Developing with the Python SDK

Gradle can build and test python, and is used by the Jenkins jobs, so needs to be maintained.

You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you do want to use Python tools directly, we recommend setting up a virtual environment before testing your code.

If you update any of the cythonized files in Python SDK, you must install the cython package before running following command to properly test your code.

The following commands should be run in the sdks/python directory. This installs Python from source and includes the test and gcp dependencies.

On macOS/Linux:

# Initialize virtual environment called "env" in local directory.
virtualenv env

# Activate virtual environment.
. ./env/bin/activate (env)

# Install packages.
pip install -e .[gcp,test]

On Windows:

> c:\Python27\python.exe -m virtualenv > env\Scripts\activate (env) > pip install -e .[gcp,test]

This command runs all Python tests. The nose dependency is installed by [test] in pip install.

(env) $ python setup.py nosetests

You can use following command to run a single test method.

(env) $ python setup.py nosetests --tests <module>:<test class>.<test method>

For example:

(env) $ python setup.py nosetests --tests apache_beam.io.textio_test:TextSourceTest.test_progress


You can deactivate the virtualenv when done.

(env) $ deactivate


To check just for Python lint errors, run the following command.

$ ../../gradlew lint


Or use tox commands to run the lint tasks:

$ tox -e py27-lint # For python 2.7
$ tox -e py3-lint # For python 3
$ tox -e py27-lint3 # For python 2-3 compatibility


Remote testing

This step is only required for testing SDK code changes remotely (not using directrunner). In order to do this you must build the Beam tarball. From the root of the git repository, run:

$ cd sdks/python/
$ python setup.py sdist


Pass the --sdk_location flag to use the newly built version.

For example:

$ python setup.py sdist > /dev/null && \
    python -m apache_beam.examples.wordcount ... \
    --sdk_location dist/apache-beam-2.5.0.dev0.tar.gz

Run hello world against modified SDK Harness

# Build the Flink job server (default job server for PortableRunner) that stores the container locally.
./gradlew :beam-runners-flink_2.11-job-server-container:docker


# Build portable SDK Harness which builds and stores the container locally.
./gradlew :beam-sdks-python-container:docker


# Run the pipeline.
python -m apache_beam.examples.wordcount --runner PortableRunner --input <local input file> --output <local output file>


Run hello world against modified Dataflow Fn API Runner Harness and SDK Harness

# Build portable worker
./gradlew :beam-runners-google-cloud-dataflow-java-fn-api-worker:build -x spotlessJava -x rat -x test
./gradlew :beam-runners-google-cloud-dataflow-java-fn-api-worker:shadowJar

# Build portable Pyhon SDK harness and publish it to GCP
./gradlew -Pdocker-repository-root=gcr.io/dataflow-build/$USER/beam -p sdks/python/container docker
gcloud docker -- push gcr.io/dataflow-build/$USER/beam/python:latest

# Initialize python
cd sdks/python
virtualenv env
. ./env/bin/activate

# run pipeline
python -m apache_beam.examples.wordcount   --runner DataflowRunner   --num_workers 1   --project <gcp_project_name>   --output <gs://path>   --temp_location <gs://path>   --worker_harness_container_image gcr.io/dataflow-build/$USER/beam/python:latest   --experiment beam_fn_api   --sdk_location build/apache-beam-2.12.0.dev0.tar.gz   --dataflow_worker_jar './runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.12.0-SNAPSHOT.jar'   --debug




  • No labels