This page explains how to quickly get started with MADlib^=
=C2=AE using a sample problem.

One you have MADlib installed, you can use the available Jupyter notebooks for many MAD= lib algorithms.

Install M= ADlib

Please refer to the Installation Guide for MADli= b on how to install from binaries, as well as step-by-step descriptions= on how to compile from source.

Please note that a Gr= eenplum database sandbox VM with MADlib pre-installed is als= o available to get started quickly, as an alternative to installing MADlib = yourself.

Sample Problem Using Logistic Regression

The sample data set and an introduction to logistic regression are = ;described here.

T= he MADlib function used in this example is described in the MADlib logistic regression do= cumentation.

Suppose that we are working with= doctors on a project related to heart failure. The dependent variable in t= he data set is whether the patient has had a second heart attack within 1 y= ear (yes=3D1). We have two independent variables: one is whether the patien= t completed a treatment on anger control (yes=3D1), and the other is a scor= e on a trait anxiety scale (higher score means more anxious).
The idea is to train a model using labeled data, then use this model= to predict second heart attack occurrence for other patients.

To interact with the data using MADlib, use the standard = psql terminal provided by the database. You could also use a tool li= ke pgAdmin.

DROP TABLE IF =
EXISTS patients, patients_logregr, patients_logregr_summary;

CREATE TABLE patients( id INTEGER NOT NULL,
=09=09=09=09=09=09second_attack INTEGER,
=09=09=09=09=09=09treatment INTEGER,
=09=09=09=09=09=09trait_anxiety INTEGER);
                        =20
INSERT INTO patients VALUES                                                =
     =20
(1,=09=091,=09=091,=09=0970),
(3,=09=091,=09=091,=09=0950),
(5,=09=091,=09=090,=09=0940),
(7,=09=091,=09=090,=09=0975),
(9,=09=091,=09=090,=09=0970),
(11,=090,=09=091,=09=0965),
(13,=090,=09=091,=09=0945),
(15,=090,=09=091,=09=0940),
(17,=090,=09=090,=09=0955),
(19,=090,=09=090,=09=0950),
(2,=09=091,=09=091,=09=0980),
(4,=09=091,=09=090,=09=0960),
(6,=09=091,=09=090,=09=0965),
(8,=09=091,=09=090,=09=0980),
(10,=091,=09=090,=09=0960),
(12,=090,=09=091,=09=0950),
(14,=090,=09=091,=09=0935),
(16,=090,=09=091,=09=0950),
(18,=090,=09=090,=09=0945),
(20,=090,=09=090,=09=0960);

Call MADlib built-in function to train a classification m= odel using the training data table as input. Note that the 1 e= ntry in the ARRAY denotes an additional bias term in the model= in the standard way, to allow for a non-zero intercept value.

SELECT madlib.=
logregr_train(=20
    'patients',                                 -- source table
    'patients_logregr',                         -- output table
    'second_attack',                            -- labels
    'ARRAY[1, treatment, trait_anxiety]',       -- features
    NULL,                                       -- grouping columns
    20,                                         -- max number of iteration
    'irls'                                      -- optimizer
    );

View the model that has just been trained:

   -- Set exte=
nded display on for easier reading of output (\x is for psql only)
    \x on
    SELECT * from patients_logregr;

    -- ************ --
    --    Result    --
    -- ************ --
    coef                     | [-6.36346994178187, -1.02410605239327, 0.119=
044916668606]
    log_likelihood           | -9.41018298389
    std_err                  | [3.21389766375094, 1.17107844860319, 0.05497=
90458269309]
    z_stats                  | [-1.97998524145759, -0.874498248699549, 2.16=
527796868918]
    p_values                 | [0.0477051870698128, 0.38184697353045, 0.030=
3664045046168]
    odds_ratios              | [0.0017233763092323, 0.359117354054954, 1.12=
642051220895]
    condition_no             | 326.081922792
    num_rows_processed       | 20
    num_missing_rows_skipped | 0
    num_iterations           | 5
    variance_covariance      | [[10.3291381930637, -0.47430466519573, -0.17=
1995901260052], [-0.47430466519573, 1.37142473278285, -0.00119520703381598]=
, [-0.171995901260052, -0.00119520703381598, 0.00302269548003977]]
   =20
    -- Alternatively, unnest the arrays in the results for easier reading o=
f output (\x is for psql only)
    \x off
    SELECT unnest(array['intercept', 'treatment', 'trait_anxiety']) as attr=
ibute,
           unnest(coef) as coefficient,
           unnest(std_err) as standard_error,
           unnest(z_stats) as z_stat,
           unnest(p_values) as pvalue,
           unnest(odds_ratios) as odds_ratio
    FROM patients_logregr;

    -- ************ --
    --    Result    --
    -- ************ --
    +---------------+---------------+------------------+-----------+-------=
----+--------------+
    | attribute     |   coefficient |   standard_error |    z_stat |    pva=
lue |   odds_ratio |
    |---------------+---------------+------------------+-----------+-------=
----+--------------|
    | intercept     |     -6.36347  |         3.2139   | -1.97999  | 0.0477=
052 |   0.00172338 |
    | treatment     |     -1.02411  |         1.17108  | -0.874498 | 0.3818=
47  |   0.359117   |
    | trait_anxiety |      0.119045 |         0.054979 |  2.16528  | 0.0303=
664 |   1.12642    |
    +---------------+---------------+------------------+-----------+-------=
----+--------------+

Now use the model to predict the dependent variable= (second heart attack within 1 year) using the logistic regression model. F= or the purpose of demonstration, we will use the original data table to per= form the prediction. Typically a different test dataset with the same featu= res as the original training dataset would be used for prediction.

=09-- Display =
prediction value along with the original value
    SELECT p.id, madlib.logregr_predict(m.coef, ARRAY[1, p.treatment, p.tra=
it_anxiety]),
           p.second_attack
    FROM patients p, patients_logregr m
    ORDER BY p.id;

    -- ************ --
    --    Result    --
    -- ************ --
    +------+-------------------+-----------------+
    |   id | logregr_predict   |   second_attack |
    |------+-------------------+-----------------|
    |    1 | True              |               1 |
    |    2 | True              |               1 |
    |    3 | False             |               1 |
    |    4 | True              |               1 |
    |    5 | False             |               1 |
    |    6 | True              |               1 |
    |    7 | True              |               1 |
    |    8 | True              |               1 |
    |    9 | True              |               1 |
    |   10 | True              |               1 |
    |   11 | True              |               0 |
    |   12 | False             |               0 |
    |   13 | False             |               0 |
    |   14 | False             |               0 |
    |   15 | False             |               0 |
    |   16 | False             |               0 |
    |   17 | True              |               0 |
    |   18 | False             |               0 |
    |   19 | False             |               0 |
    |   20 | True              |               0 |
    +------+-------------------+-----------------+

    -- Predicting the probability of the dependent variable being TRUE.
    -- Display prediction value along with the original value
    SELECT p.id, madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trai=
t_anxiety])
    FROM patients p, patients_logregr m
    ORDER BY p.id;

    -- ************ --
    --    Result    --
    -- ************ --
    +------+------------------------+
    |   id |   logregr_predict_prob |
    |------+------------------------|
    |    1 |              0.720223  |
    |    2 |              0.894355  |
    |    3 |              0.19227   |
    |    4 |              0.685513  |
    |    5 |              0.167748  |
    |    6 |              0.798098  |
    |    7 |              0.928568  |
    |    8 |              0.959306  |
    |    9 |              0.877576  |
    |   10 |              0.685513  |
    |   11 |              0.586701  |
    |   12 |              0.19227   |
    |   13 |              0.116032  |
    |   14 |              0.0383829 |
    |   15 |              0.0674976 |
    |   16 |              0.19227   |
    |   17 |              0.545871  |
    |   18 |              0.267675  |
    |   19 |              0.398619  |
    |   20 |              0.685513  |
    +------+------------------------+

If the probability is greater than 0.5, t= he prediction is given as True. Otherwise it is given as False.

Next Step= s

For details on all of the machine learning functions provided by MADlib= , please refer to User Documentation
Try out the available Jupyter notebooks for many MADlib algorithms
To contribute new modules to MADlib, please refer to the Quick Start Guide for Developers

Quick Start Guide for Users

Install M= ADlib

Sample Problem Using Logistic Regression

Next Step= s