Page tree
Skip to end of metadata
Go to start of metadata


1 - MADlib General

Q1-1  What is MADlib?

  • MADlib is an open source library for scalable in-database analytics.  It provides data-parallel implementations of mathematical, statistical, graph and machine learning methods for structured and unstructured data.   

  • MADlib is a Top Level Project in the Apache Software Foundation.

  • MADlib home page http://madlib.apache.org/

Q1-2  What database platforms does MADlib support and what is the upgrade matrix?

  • MADlib runs on Greenplum database (GPDB) and and PostgreSQL.  The last three major versions of these databases are directly tested with each MADlib release.  Older versions of these databases may work fine as well, but we don't test them explicitly.

MADlib Version

Greenplum

Pivotal HDB/

Apache HAWQ (incubating)

PostgreSQL

Upgrade Path from Previous

MADlib Versions*

1.15.15.x
4.3.x
n/a10.x
9.6
1.15
1.14
1.13
1.155.x
4.3.x
n/a10.x
9.6
1.14
1.13
1.12
1.14

5.x
4.3.x

n/a

10.x
9.6

1.13
1.12
1.11
1.13

5.x
4.3.x

n/a

9.6
9.5

1.12
1.11
1.10
1.125.x
4.3.x
2.x9.6
9.5
1.11
1.10
1.9.1
1.115.x
4.3.x
2.x9.6
9.5
1.10
1.9.1
1.9
1.104.3.x
4.2.x
2.x
1.3.x
9.6
9.5
1.9.1
1.9
1.8
1.9.14.3.x
4.2.x
2.x
1.3.x
9.6
9.5
1.9 and all earlier versions

* If you are on an older version of MADlib you may need to do multiple hops to update to the latest version.

  • OS support generally follows that of the underlying databases:

    • GPDB 4.3.x

      1. RHEL 5.5-5.7, 6.1, 6.2, 6.4, 6.5

      2. CentOS 5.5-5.7, 6.1, 6.2

    • GPDB 5+ does not support RHEL5 or CentOS5.
      1. RHEL 6.1, 6.2, 6.4, 6.5, 7.x

      2. CentOS 6.1, 6.2, 7.x

      3. Ubuntu 16.04
    • HDB/HAWQ

      1. RHEL 6.1, 6.2, 6.4, 6.5

      2. CentOS 6.1, 6.2

    • PostgreSQL

      1. RHEL 5.x, 6.x, 7.x

      2. CentOS 5.x, 6.x, 7.x

      3. Mac OSX 10.6+

      4. Ubuntu 16.04

Q1-3  What are the main advantages of MADlib?

  • Use MPP architecture’s full compute power

  • Use MPP architecture’s entire memory to process large data sets, so don’t need to sample

  • Familiar SQL interface

Q1-4  Who uses MADlib?

  • Data scientists mostly

  • Wide range of verticals including financial services, healthcare, retail, energy, manufacturing and government

Q1-5  What are the benefits of MADlib compared with products like R and scikit-learn?

  • Performance

    • MADlib is a fully parallelized implementation on GPBD and HAWQ for large data sets, so it is far more performant than R or Python libraries.

  • Scalability

    • Add more nodes to achieve higher performance as your data scales.  R and Python libraries are limited by the amount of data you can load into memory on a single node.

    • Using all data, not a sample, can improve accuracy

  • Familiar, user friendly SQL interface

  • Ease of data preparation

    • Supports commonly used database data formats

Q1-6  Where can I find the details of how existing MADlib algorithms have been implemented?

Q1-7  What is the MADlib security model?

2 - MADlib Usage

Q2-1  What are my options if MADlib does not have the algorithm that I need?

  • In this case, you might think about building separate models for different chunks of a larger dataset (partition by state, range of user IDs, product category, etc.)

  • This is referred to data parallelism:  break up the problem into a number of parallel tasks, if you can ensure there is no dependency (or communication) between those parallel tasks.

  • Then you can use a procedural languages such as PL/Python or PL/R on each chunk and combine the results downstream.

  • And of course, you can build and contribute the module to MADlib for the benefit of the community

Q2-2  What are the differences in functionality between the GPDB and HDB/HAWQ versions of MADlib?

  • There are very few differences and they are listed below.

  • K-means clustering

    • Can specify a user defined distance function for GPDB & PostgreSQL.  For HAWQ, does not support UDFs for distance, so restricted to the built-in distance functions provided.

  • “Deprecated modules” quartile and profile have some minor differences and limitations between HAWQ and GPDB.  See the documentation for details.

Q2-3  Can I export models from MADlib to PMML?

  • Yes.  MADlib models can be exported in PMML format for use in scoring by a PMML evaluator.  

  • The following MADlib algorithms can be exported in PMML format:

    • Linear regression

    • Logistic regression

    • GLM

    • Multinomial regression

    • Ordinal regression

    • Decision trees

    • Random forest

Q2-4  What is PMML?

  • The Predictive Model Markup Language (PMML) is an XML-based file format that provides a way for applications to describe and exchange models produced by data mining and machine learning algorithms.

  • For more information, please see  http://www.dmg.org/

Q2-5  What is JPMML?

Q2-6  Can I import models from PMML to MADlib?

  • Not currently.  You can only export from MADlib into PMML as described above.

Q2-7  Can I call MADlib from Python?

Q2-8  What are some examples of connecting MADlib with 3rd party products?

3 - PivotalR

Q3-1  What is PivotalR?

  • PivotalR is a package that enables users of R, the most popular open source statistical programming language and environment, to interact with the GPDB,  HAWQ and the open source database PostgreSQL on large data sets. It does so by providing an interface to the operations on tables/views in the database.   

Q3-2  Where can I learn more about PivotalR?

Please refer to this PivotalR wiki page.

4 - Other

Q4-1  Your question here

  • Let us know if you have a question and we will add it here.

  • No labels