Page tree
Skip to end of metadata
Go to start of metadata

This page explains how to quickly get started with MADlib® using a sample problem. 

  1. Install MADlib
  2. Sample problem using logistic regression
  3. Next steps

One you have MADlib installed, you can use the available Jupyter notebooks for many MADlib algorithms.

Install MADlib

Please refer to the Installation Guide for MADlib on how to install from binaries, as well as step-by-step descriptions on how to compile from source.

Please note that a Greenplum database sandbox VM with MADlib pre-installed is also available to get started quickly, as an alternative to installing MADlib yourself.

Sample Problem Using Logistic Regression

  1. The sample data set and an introduction to logistic regression are described here.

    The MADlib function used in this example is described in the MADlib logistic regression documentation.

    Suppose that we are working with doctors on a project related to heart failure. The dependent variable in the data set is whether the patient has had a second heart attack within 1 year (yes=1). We have two independent variables: one is whether the patient completed a treatment on anger control (yes=1), and the other is a score on a trait anxiety scale (higher score means more anxious).

    The idea is to train a model using labeled data, then use this model to predict second heart attack occurrence for other patients.
     
  2. To interact with the data using MADlib, use the standard psql terminal provided by the database. You could also use a tool like pgAdmin.

  3. Call MADlib built-in function to train a classification model using the training data table as input:

     

  4. View the model that has just been trained:

     

  5. Now use the model to predict the dependent variable (second heart attack within 1 year) using the logistic regression model. For the purpose of demonstration, we will use the original data table to perform the prediction. Typically a different test dataset with the same features as the original training dataset would be used for prediction.

The 1 entry in the ARRAY denotes an additional bias term in the model in the standard way, to allow for a non-zero intercept value.

If the probability is greater than 0.5, the prediction is given as True. Otherwise it is given as False.

Next Steps

 

  • No labels