View Source

Prerequisites

For MADlib 1.17.0 and 1.18.0 deep learning feature, users are required to install Tensorflow and other dependencies separately.

Configurations

The MADlib 1.17.0 and 1.18.0 deep learning module is tested on the following configurations:
1. Greenplum v6.5+, CentOS 6/7 (CentOS 7 strongly recommended, seen Note 2 below re CentOS 6)
2. Greenplum v6.5+, Ubuntu 18
3. Greenplum v5.24, CentOS 6
4. Greenplum v5.24, Ubuntu 18
5. Postgres 11 and 12, CentOS 7
6. Postgres 11 and 12, Ubuntu 18
7. Deep learning library versions
  1. Tensorflow: 1.14
  2. Scipy: 1.2.1
  3. cudNN: 7.4.2
  4. CUDA: Cuda compilation tools, release 10.0, V10.0.130
8. GPU configuration: Nvidia Tesla P100-PCIE-16GB
MADlib 1.17.0 and 1.18.0 deep learning module is supported for the following versions:

Greenplum 5 and Greenplum 6 or newer. Greenplum 6.5.0+ recommended (see Note 8 below).
Python 2.7
Tensorflow <= 1.14 (TensorFlow 1.14 is highly recommended)

Known issues with the MADlib 1.17.0 and 1.18.0 deep Learning module on Greenplum/Postgres:

PYTHONPATH:
When compiling Greenplum from source, there are known issues related to changing the PYTHONPATH environment variable after compilation. In order for the deep learning features of MADlib to function properly, it requires Keras, TensorFlow, and dependencies to be installed in the same Python directory that was set in the PYTHONPATH environment variable before compiling Greenplum. If a new directory is added to PYTHONPATH later, it will not get reflected on the segments unless Greenplum is recompiled and restarted.
NOTE: If Greenplum is installed using gppkg or another binary package and the PYTHONPATH is set as default, users should be able to `pip install` Keras(2.2.4), TensorFlow(1.14) and all other dependencies in the appropriate location, that would be used by MADlib deep learning functions.
Support for Keras/TensorFlow on Greenplum/Postgres on CentOS 6:
Currently CentOS 6 comes with glibc 2.12, while TensorFlow installation requires at least glibc 2.17. MADlib deep learning module on CentOS 6, requires installing Keras and TensorFlow which might need compiling glibc from source. Having a higher version of glibc with Greenplum 5 may impact database behavior. Summary: CentOS 7 strongly recommended. Also CentOS 6 will reach end of life on 30 November 2020.
GPU memory management:
1. Currently CUDA GPU memory cannot be released until the postgres process holding it is terminated.
2. When any of the deep learning functions are called with GPUs, Greenplum internally creates a process (also called a slice) which calls TensorFlow to do the computation. This process holds the GPU memory until one of the following two things happen:
  1. Query finishes and user logs out of the Postgres client/session.
  2. Query finishes, user waits for the timeout set by `gp_vmem_idle_resource_timeout`. The default value for this timeout is 18 sec as per https://gpdb.docs.pivotal.io/6-5/ref_guide/config_params/guc-list.html
3. Recommendation:
  1. Log out/reconnect to the session after every GPU query, or
  2. Wait for `gp_vmem_idle_resource_timeout` before you run another GPU query. You can also set the `gp_vmem_idle_resource_timeout` GUC to a lower value.
It is advisable to logout of the current psql session when switching from using CPU to GPU for computation or vice versa. Internally in the code, the CUDA environment variable `CUDA_VISIBLE_DEVICES` is set based on the use_gpus flag. Once this variable is set to -1 (disable GPU), there is no way to reset it to using GPU and that session will always use only CPU.
Recommended configuration for GPUs setup: 1 GPU available per segment. If the number of GPUs per segment host is less than the number of segments per segment host, different segments share the same GPU, which may fail in some scenarios.
Recommended way to specify metric, optimizer and loss values in compile_params argument: e.g., loss=mean_squared_error
Keras JSON serialization seems not to be compatible for different versions of Keras. For example, we had an issue loading a local Keras 2.2.4 model to the cluster, which had 2.1.6 installed.
If 'madlib_keras_fit_multiple_model()' is run on GPDB 5 and some versions of GPBD 6 , the database will keep adding to the disk space (in proportion to model size) and will only release the disk space once the fit multiple query has completed execution. This is not the case for GPDB 6.5.0+ where disk space is released during the fit multiple query.