We appreciate all forms of project contributions including bug reports, providing help to new users, documentation, or code patches.  

This page lists some starter projects that new contributors could work on as a way of getting more familiar with MADlib®.  These starter JIRAs are tagged with the label "starter" in https://issues.apache.org/jira/browse/MADLIB/.

Please also refer to the Contribution Guidelines and Quick Start Guide for Developers.  

Documentation

No.ItemDescriptionLink
1Improve module documentationReview the latest MADlib documentation http://madlib.apache.org/docs/latest/ and make any needed updates to content or accuracy. You can also add additional examples.
MADLIB-922 - Getting issue details... STATUS
2Improve online helpStandardize on-line help so syntax is the same for all modules.
MADLIB-923 - Getting issue details... STATUS
3Create sample data science note booksCreate Jupyter or Apache Zeppelin showing how to use various modules in Apache MADlib.  These are maintained at https://github.com/apache/madlib-site/tree/asf-site/community-artifacts
MADLIB-1127 - Getting issue details... STATUS
4


Bug Fixes and Improvements

No.ItemDescriptionLink
1Improved error message for Elastic Net predict()When we pass the selected coefficients to elastic net's "predict()" function, it throws as ugly error message which is not indicative of the real error.
MADLIB-835 - Getting issue details... STATUS
2Confusing Error Messages while running elastic net prediction functionFix confusing error message
MADLIB-787 - Getting issue details... STATUS
3LDA (parsed) model table and output table disagreeInvestigate and determine if this is an issue. If it is, repair it.
MADLIB-899 - Getting issue details... STATUS
4PivotalR test failures indicate potential bugs in MADlib GLMThese problems may be just numerical issues with too large the condition numbers or too small of a training set. To be investigated.
MADLIB-896 - Getting issue details... STATUS
5Implement skipping of arrays-with-NULL for elastic net predictBetter NULL handling for elastic net predict.
MADLIB-919 - Getting issue details... STATUS
6Improve RF output format for variable importanceEasier way of accessing the variable importance output from random forest so that I can understand which are the most important variables.
MADLIB-925 - Getting issue details... STATUS
7Covariance matrixAdd parameter to output covariance matrix to Pierson's correlation function.
MADLIB-941 - Getting issue details... STATUS
8



New Non-Iterative Modules

No.ItemDescriptionLink
1k-Nearest NeighborsInitial implementation of k-NN
MADLIB-927 - Getting issue details... STATUS
2k-Nearest Neighbors - Phase 2Add additional features to k-NN
MADLIB-1129 - Getting issue details... STATUS MADLIB-1059 - Getting issue details... STATUS MADLIB-1060 - Getting issue details... STATUS MADLIB-1061 - Getting issue details... STATUS
3Stratified samplingUtility to perform stratified, randomized, proportional sampling and labeling.
MADLIB-986 - Getting issue details... STATUS
4URI utilitiesA set of utilities for parsing and extracting URIs from text.
MADLIB-910 - Getting issue details... STATUS
5AnonymizationUtility for anonymization.
MADLIB-911 - Getting issue details... STATUS
6SessionizationUtility to partition event streams into sessions by timeouts and identifiers
MADLIB-909 - Getting issue details... STATUS
7Mixed Effects ModelingMixed-effects model containing fixed-effects and random-effects components.
MADLIB-987 - Getting issue details... STATUS
8

New Methods: Expectation Maximimization

Gaussian mixture modeling and others
MADLIB-410 - Getting issue details... STATUS
9



New Iterative Modules

No.ItemDescriptionLink
1Model parameter weightingAssign weights to training samples or observations.
MADLIB-988 - Getting issue details... STATUS
2New clustering algorithmOPTICS and/or DBSCAN clustering algo
MADLIB-1017 - Getting issue details... STATUS
3



PivotalR

PivotalR is a package that enables users of R, the most popular open source statistical programming language and environment, to interact with the Greenplum database,  HDB/HAWQ and PostgreSQL on large data sets. It does so by providing an interface to the operations on tables/views in the database.   

It would be very valuable to add to support for more MADlib modules in PivotalR.  Please refer to this PivotalR wiki page for more information on how to do this. 


  • No labels