MADlib 2.X requires python version 3.9. Other python 3 versions might work as well. Python 2.x is not supported.
MADlib requires the GNU M4 Unix macro processor which must be present for installation to succeed.
Currently supported database versions: GPDB 6 (with python3 extension), GPDB 7, PostgreSQL 15
The following python libraries are required for their associated modules:
Installation: pyyaml==6.0.1, pyxb-x==1.2.6.1
Various: numpy==1.25.2
Deep Learning: dill==0.3.7, grpcio==1.57.0, protobuf==3.19.4, hyperopt==0.2.5, tensorflow == 2.10, scikit-learn==1.3.0
XGBoost: pandas==2.0.3, xgboost==1.7.6
KNN: scipy==1.11.2
Unit tests: pgsanity
MADlib currently supports Greenplum database with binaries.
If the environment variables listed below are defined, it can save you some typing.
Greenplum:
on Redhat / CentOS run the following as gpadmin:
gppkg install <madlib_package> |
Ensure that psql, postgres, and pg_config are in your path
which psql postgres pg_config |
Ensure that the database is started and running
psql -c 'select version()' |
The above may need user/port/password setting depending on how the database has been configured.
Greenplum Database:
/usr/local/madlib/bin/madpack –p greenplum install |
if environment variables are defined. Otherwise use a fully defined connection string:
/usr/local/madlib/bin/madpack -s madlib -p postgres -c [user[/password]@][host][:port][/database] install |
The command above may need user/port/password setting depending on how the database has been configured.
After installation gpadmin should grant all privileges on schema madlib to users who will be accessing MADlib functions. Otherwise, users will get "ERROR: permission denied for schema MADlib." Also, install checks (see next step below) will fail if CREATE TEMP TABLE privileges are not granted on the schema where MADlib is installed. See the PostgreSQL docs for information on schemas and privileges.
Test your installation
Greenplum Database:
/usr/local/madlib/bin/madpack –p greenplum install-check |
The command above may need user/port/password setting depending on how the database has been configured.
Please note that if the optimizer_control GUC is set to off in Greenplum, the following install checks will fail, and these MADlib functions will not work: decision tree, random forest, LDA , k-Means, PMML export for decision tree, PMML export for random forest. This will be fixed in a future release (MADLIB-1109). The parameter optimizer_control controls whether the server configuration parameter optimizer can be changed. The parameter optimizer controls whether the GPORCA optimizer is enabled when running SQL queries.
Requirements for compiling and installing MADlib:
python3 -m venv venv
Postgres platform notes:
/usr/local/madlib/bin/madpack -s madlib -p postgres install madpack.py : INFO : Detected PostgreSQL version 9.5. madpack.py : INFO : *** Installing MADlib *** madpack.py : INFO : MADlib tools version = 1.9.1 (//usr/local/madlib/Versions/1.9.1/bin/../madpack/madpack.py) madpack.py : INFO : MADlib database version = None (host=localhost:5432, db=postgres, schema=madlib) madpack.py : INFO : Testing PL/Python environment... madpack.py : INFO : > Creating language PL/Python... madpack.py : ERROR : SQL command failed: SQL: CREATE LANGUAGE plpythonu; ERROR: could not access file "$libdir/plpython2": No such file or directory madpack.py : ERROR : Cannot create language plpythonu. Please check if you have configured and installed portid (your platform) with `--with-python` option. Stopping installation... madpack.py : ERROR : MADlib installation failed |
Ensure prerequisites and necessary python dependencies are installed.
In the $MADLIB_ROOT
directory (location of the MADlib source) run the following commands:
mkdir build cd build cmake .. # pass -DCXX11=1 when compiling with OSX make -j8 # if this causes issues, switch back to a plain `make` |
Above, we built the executables in the build
folder. This can, however, be any user-named folder (henceforth called $BUILD_ROOT
).
Install MADlib into the database with MADlib package manager madpack
located under $BUILD_ROOT/src/bin
.
Run the MADlib deployment utility to install MADlib into each database that you want to use it:
Postgres:
$BUILD_ROOT/src/bin/madpack -s madlib –p postgres install |
if environment variables are defined. Otherwise use a fully defined connection string:
$BUILD_ROOT/src/bin/madpack -s madlib -p postgres -c [user[/password]@][host][:port][/database] install |
Greenplum Database:
$BUILD_ROOT/src/bin/madpack –p greenplum install |
The above may need user/port/password setting depending on how the database has been configured.
To install:
$BUILD_ROOT/src/bin/madpack -p postgres -c [user[/password]@][host][:port][/database] install |
To make sure that the installation is successful:
$BUILD_ROOT/src/bin/madpack -p postgres -c [user[/password]@][host][:port][/database] install-check |
For more information on the usage of madpack:
$BUILD_ROOT/src/bin/madpack --help |
git clone https://github.com/apache/madlib.git cd madlib git checkout madlib2-master #source GPDB7 environment source $GPHOME/greenplum_path.sh rm -rf $GPHOME/lib/python/yaml/ # Uninstall libboost to avoid version conflict with MADlib and use the one downloaded at build time cd build python3 -m venv venv #only needed once to bootstrap virtual env pip3 install pyyaml pyxb-x cmake .. # pass -DCXX11=1 when compiling with OSX make -j8 # May cause a failure when trying to download libboost for the first time # re-run make if fails ./src/bin/madpack -p greenplum -c /<database> install |
The variables below will be automatically used by the madpack
installer if no connection string is provided:
PGUSER
or USER
(defaults to OS username)PGPASSWORD
(defaults to empty)PGHOST
(defaults to 'localhost')PGDATABASE
(defaults to OS username)PGPORT
(defaults to 5432)An example of deploying MADlib using the environment variables:
export PGPORT=5430 export PGHOST=127.0.0.1 export PGDATABASE=madlibtest $BUILD_ROOT/src/bin/madpack -p postgres install |
The variables below can be set in GPDB in case memory-related issues show up. Feel free to adjust them based on the specifics of the installed system.
set max_statement_mem='50GB'; set statement_mem='50GB'; set memory_spill_ratio=80; set gp_resqueue_memory_policy=auto; set work_mem='4GB'; set gp_vmem_protect_limit=20000 |
Upgrading gppkg to a higher version of MADlib:
For example, upgrading from 2.0.0 to 2.1.0
on Redhat / CentOS run the following as gpadmin:
gppkg install <madlib_package_upgrading_to> |
Upgrade the MADlib deployment in the database
madpack -p <platform> -c <connection> upgrade |