Date: Tue, 19 Mar 2024 02:06:40 +0000 (UTC) Message-ID: <982056286.54004.1710814000586@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_54003_1528095082.1710814000586" ------=_Part_54003_1528095082.1710814000586 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This page explains how to quickly get started with MADlib= =C2=AE using a sample problem.
One you have MADlib installed, you can use the available Jupyter notebooks for many MAD= lib algorithms.
Please refer to the Installation Guide for MADli= b on how to install from binaries, as well as step-by-step descriptions= on how to compile from source.
Please note that a Gr= eenplum database sandbox VM with MADlib pre-installed is als= o available to get started quickly, as an alternative to installing MADlib = yourself.
To interact with the data using MADlib, use the standard =
psql
terminal provided by the database. You could also use a tool li=
ke pgAdmin.
DROP TABLE IF = EXISTS patients, patients_logregr, patients_logregr_summary; CREATE TABLE patients( id INTEGER NOT NULL, =09=09=09=09=09=09second_attack INTEGER, =09=09=09=09=09=09treatment INTEGER, =09=09=09=09=09=09trait_anxiety INTEGER); =20 INSERT INTO patients VALUES = =20 (1,=09=091,=09=091,=09=0970), (3,=09=091,=09=091,=09=0950), (5,=09=091,=09=090,=09=0940), (7,=09=091,=09=090,=09=0975), (9,=09=091,=09=090,=09=0970), (11,=090,=09=091,=09=0965), (13,=090,=09=091,=09=0945), (15,=090,=09=091,=09=0940), (17,=090,=09=090,=09=0955), (19,=090,=09=090,=09=0950), (2,=09=091,=09=091,=09=0980), (4,=09=091,=09=090,=09=0960), (6,=09=091,=09=090,=09=0965), (8,=09=091,=09=090,=09=0980), (10,=091,=09=090,=09=0960), (12,=090,=09=091,=09=0950), (14,=090,=09=091,=09=0935), (16,=090,=09=091,=09=0950), (18,=090,=09=090,=09=0945), (20,=090,=09=090,=09=0960);
Call MADlib built-in function to train a classification m=
odel using the training data table as input. Note that the 1
e=
ntry in the ARRAY
denotes an additional bias term in the model=
in the standard way, to allow for a non-zero intercept value.
=
p>
SELECT madlib.= logregr_train(=20 'patients', -- source table 'patients_logregr', -- output table 'second_attack', -- labels 'ARRAY[1, treatment, trait_anxiety]', -- features NULL, -- grouping columns 20, -- max number of iteration 'irls' -- optimizer );
View the model that has just been trained:
-- Set exte= nded display on for easier reading of output (\x is for psql only) \x on SELECT * from patients_logregr; -- ************ -- -- Result -- -- ************ -- coef | [-6.36346994178187, -1.02410605239327, 0.119= 044916668606] log_likelihood | -9.41018298389 std_err | [3.21389766375094, 1.17107844860319, 0.05497= 90458269309] z_stats | [-1.97998524145759, -0.874498248699549, 2.16= 527796868918] p_values | [0.0477051870698128, 0.38184697353045, 0.030= 3664045046168] odds_ratios | [0.0017233763092323, 0.359117354054954, 1.12= 642051220895] condition_no | 326.081922792 num_rows_processed | 20 num_missing_rows_skipped | 0 num_iterations | 5 variance_covariance | [[10.3291381930637, -0.47430466519573, -0.17= 1995901260052], [-0.47430466519573, 1.37142473278285, -0.00119520703381598]= , [-0.171995901260052, -0.00119520703381598, 0.00302269548003977]] =20 -- Alternatively, unnest the arrays in the results for easier reading o= f output (\x is for psql only) \x off SELECT unnest(array['intercept', 'treatment', 'trait_anxiety']) as attr= ibute, unnest(coef) as coefficient, unnest(std_err) as standard_error, unnest(z_stats) as z_stat, unnest(p_values) as pvalue, unnest(odds_ratios) as odds_ratio FROM patients_logregr; -- ************ -- -- Result -- -- ************ -- +---------------+---------------+------------------+-----------+-------= ----+--------------+ | attribute | coefficient | standard_error | z_stat | pva= lue | odds_ratio | |---------------+---------------+------------------+-----------+-------= ----+--------------| | intercept | -6.36347 | 3.2139 | -1.97999 | 0.0477= 052 | 0.00172338 | | treatment | -1.02411 | 1.17108 | -0.874498 | 0.3818= 47 | 0.359117 | | trait_anxiety | 0.119045 | 0.054979 | 2.16528 | 0.0303= 664 | 1.12642 | +---------------+---------------+------------------+-----------+-------= ----+--------------+
Now use the model to predict the dependent variable= (second heart attack within 1 year) using the logistic regression model. F= or the purpose of demonstration, we will use the original data table to per= form the prediction. Typically a different test dataset with the same featu= res as the original training dataset would be used for prediction.
=09-- Display = prediction value along with the original value SELECT p.id, madlib.logregr_predict(m.coef, ARRAY[1, p.treatment, p.tra= it_anxiety]), p.second_attack FROM patients p, patients_logregr m ORDER BY p.id; -- ************ -- -- Result -- -- ************ -- +------+-------------------+-----------------+ | id | logregr_predict | second_attack | |------+-------------------+-----------------| | 1 | True | 1 | | 2 | True | 1 | | 3 | False | 1 | | 4 | True | 1 | | 5 | False | 1 | | 6 | True | 1 | | 7 | True | 1 | | 8 | True | 1 | | 9 | True | 1 | | 10 | True | 1 | | 11 | True | 0 | | 12 | False | 0 | | 13 | False | 0 | | 14 | False | 0 | | 15 | False | 0 | | 16 | False | 0 | | 17 | True | 0 | | 18 | False | 0 | | 19 | False | 0 | | 20 | True | 0 | +------+-------------------+-----------------+ -- Predicting the probability of the dependent variable being TRUE. -- Display prediction value along with the original value SELECT p.id, madlib.logregr_predict_prob(coef, ARRAY[1, treatment, trai= t_anxiety]) FROM patients p, patients_logregr m ORDER BY p.id; -- ************ -- -- Result -- -- ************ -- +------+------------------------+ | id | logregr_predict_prob | |------+------------------------| | 1 | 0.720223 | | 2 | 0.894355 | | 3 | 0.19227 | | 4 | 0.685513 | | 5 | 0.167748 | | 6 | 0.798098 | | 7 | 0.928568 | | 8 | 0.959306 | | 9 | 0.877576 | | 10 | 0.685513 | | 11 | 0.586701 | | 12 | 0.19227 | | 13 | 0.116032 | | 14 | 0.0383829 | | 15 | 0.0674976 | | 16 | 0.19227 | | 17 | 0.545871 | | 18 | 0.267675 | | 19 | 0.398619 | | 20 | 0.685513 | +------+------------------------+
If the probability is greater than 0.5, t=
he prediction is given as True
. Otherwise it is given as False
.