We appreciate all forms of project contributions including bug reports, providing help to new users, documentation, or code patches.
This page lists some starter projects that new contributors could work on as a way of getting more familiar with MADlib®. These starter JIRAs are tagged with the label "starter" in https://issues.apache.org/jira/browse/MADLIB/.
Please also refer to the Contribution Guidelines and Quick Start Guide for Developers.
Documentation
No. | Item | Description | Link |
---|---|---|---|
1 | Improve module documentation | Review the latest MADlib documentation http://madlib.apache.org/docs/latest/ and make any needed updates to content or accuracy. You can also add additional examples. | |
2 | Improve online help | Standardize on-line help so syntax is the same for all modules. | |
3 | Create sample data science note books | Create Jupyter or Apache Zeppelin showing how to use various modules in Apache MADlib. These are maintained at https://github.com/apache/madlib-site/tree/asf-site/community-artifacts | |
4 |
Bug Fixes and Improvements
No. | Item | Description | Link |
---|---|---|---|
1 | Improved error message for Elastic Net predict() | When we pass the selected coefficients to elastic net's "predict()" function, it throws as ugly error message which is not indicative of the real error. | |
2 | Confusing Error Messages while running elastic net prediction function | Fix confusing error message | |
3 | LDA (parsed) model table and output table disagree | Investigate and determine if this is an issue. If it is, repair it. | |
4 | PivotalR test failures indicate potential bugs in MADlib GLM | These problems may be just numerical issues with too large the condition numbers or too small of a training set. To be investigated. | |
5 | Implement skipping of arrays-with-NULL for elastic net predict | Better NULL handling for elastic net predict. | |
6 | Improve RF output format for variable importance | Easier way of accessing the variable importance output from random forest so that I can understand which are the most important variables. | |
7 | Covariance matrix | Add parameter to output covariance matrix to Pierson's correlation function. | |
8 |
New Non-Iterative Modules
No. | Item | Description | Link |
---|---|---|---|
1 | k-Nearest Neighbors | Initial implementation of k-NN | |
2 | k-Nearest Neighbors - Phase 2 | Add additional features to k-NN | |
3 | Stratified sampling | Utility to perform stratified, randomized, proportional sampling and labeling. | |
4 | URI utilities | A set of utilities for parsing and extracting URIs from text. | |
5 | Anonymization | Utility for anonymization. | |
6 | Sessionization | Utility to partition event streams into sessions by timeouts and identifiers | |
7 | Mixed Effects Modeling | Mixed-effects model containing fixed-effects and random-effects components. | |
8 | New Methods: Expectation Maximimization | Gaussian mixture modeling and others | |
9 |
New Iterative Modules
No. | Item | Description | Link |
---|---|---|---|
1 | Model parameter weighting | Assign weights to training samples or observations. | |
2 | New clustering algorithm | OPTICS and/or DBSCAN clustering algo | |
3 |
PivotalR
PivotalR is a package that enables users of R, the most popular open source statistical programming language and environment, to interact with the Greenplum database, HDB/HAWQ and PostgreSQL on large data sets. It does so by providing an interface to the operations on tables/views in the database.
It would be very valuable to add to support for more MADlib modules in PivotalR. Please refer to this PivotalR wiki page for more information on how to do this.