Developing with the Python SDK
Gradle can build and test python, and is used by the Jenkins jobs, so needs to be maintained.
You can directly use the Python toolchain instead of having Gradle orchestrate it, which may be faster for you, but it is your preference. If you do want to use Python tools directly, we recommend setting up a virtual environment before testing your code.
If you update any of the cythonized files in Python SDK, you must install the cython
package before running following command to properly test your code.
The following commands should be run in the sdks/python
directory. This installs Python from source and includes the test and gcp dependencies.
On macOS/Linux:
# Initialize virtual environment called "env" in local directory. virtualenv env # Activate virtual environment. . ./env/bin/activate (env) # Install packages. pip install -e .[gcp,test]
On Windows:
> c:\Python27\python.exe -m virtualenv > env\Scripts\activate (env) > pip install -e .[gcp,test]
This command runs all Python tests. The nose dependency is installed by [test] in pip install.
(env) $ python setup.py nosetests
You can use following command to run a single test method.
(env) $ python setup.py nosetests --tests <module>:<test class>.<test method>
For example:
(env) $ python setup.py nosetests --tests apache_beam.io.textio_test:TextSourceTest.test_progress
You can deactivate the virtualenv when done.
(env) $ deactivate
To check just for Python lint errors, run the following command.
$ ../../gradlew lint
Or use tox
commands to run the lint tasks:
$ tox -e py27-lint # For python 2.7 $ tox -e py3-lint # For python 3 $ tox -e py27-lint3 # For python 2-3 compatibility
Remote testing
This step is only required for testing SDK code changes remotely (not using directrunner). In order to do this you must build the Beam tarball. From the root of the git repository, run:
$ cd sdks/python/ $ python setup.py sdist
Pass the --sdk_location
flag to use the newly built version.
For example:
$ python setup.py sdist > /dev/null && \ python -m apache_beam.examples.wordcount ... \ --sdk_location dist/apache-beam-2.5.0.dev0.tar.gz
Run hello world against modified SDK Harness
# Build the Flink job server (default job server for PortableRunner) that stores the container locally. ./gradlew :beam-runners-flink_2.11-job-server-container:docker # Build portable SDK Harness which builds and stores the container locally. ./gradlew :beam-sdks-python-container:docker # Run the pipeline. python -m apache_beam.examples.wordcount --runner PortableRunner --input <local input file> --output <local output file>
Run hello world against modified Dataflow Fn API Runner Harness and SDK Harness
# Build portable worker ./gradlew :beam-runners-google-cloud-dataflow-java-fn-api-worker:build -x spotlessJava -x rat -x test ./gradlew :beam-runners-google-cloud-dataflow-java-fn-api-worker:shadowJar # Build portable Pyhon SDK harness and publish it to GCP ./gradlew -Pdocker-repository-root=gcr.io/dataflow-build/$USER/beam -p sdks/python/container docker gcloud docker -- push gcr.io/dataflow-build/$USER/beam/python:latest # Initialize python cd sdks/python virtualenv env . ./env/bin/activate # run pipeline python -m apache_beam.examples.wordcount --runner DataflowRunner --num_workers 1 --project <gcp_project_name> --output <gs://path> --temp_location <gs://path> --worker_harness_container_image gcr.io/dataflow-build/$USER/beam/python:latest --experiment beam_fn_api --sdk_location build/apache-beam-2.12.0.dev0.tar.gz --dataflow_worker_jar './runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.12.0-SNAPSHOT.jar' --debug