Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Introduction
  2. Naming Conventions
    • 2.1. Schema
    • 2.2. Functions & Aggregates
  3. Functions and Languages
  4. Function Name Overloading
  5. Guide to Driver UDFs
    • 5.1. Input Definition
    • 5.2. Output Definition
    • 5.3. Logging
    • 5.4. Parameter Validation
    • 5.5. Multi-User and Multi-Session Execution
  6. Support Modules
    • 6.1. DB connectivity module: plpy.py
    • 6.2. Python abstraction layer.

1. Introduction

The purpose of this document is to define the SQL Interface interface for MADlib algorithms.

2. Naming Conventions

Names should use lower - case characters separated with by underscores.

This is applicable to all database objects (tables, views, functions, function parameters, datatypesdata types, operators, etc).

2.1. Schema

All database objects should be created in the default MADlib schema. Use MADLIB_SCHEMA as the schema prefix for your tables/views/functions/etc. in any scripts. This  This literal will be replaced during the installation with the target schema name (configured by the user in Config.yml). Code  Code examples below use prefix madlib for illustration purposes only.

...

Module specific routines should have a SHORT and COMMON prefix based on the module they belong to; for .  For example: 

  • Multi-linear regression functions could start with **mregr_**:
Code Block
languagesql
madlib.mregr_coef(...)

  *

  • Naive-Bayes classification functions could start with **nb_**: 
Code Block
languagesql
madlib.nb_create_view(...)

...

General purpose routines should be named without a reference to any module and should be created inside a general purpose MADlib modules module (e.g. Array Operations). For  For example:

  • Function returning the key of the row for which value is maximal:

...

Code Block
languagesql
madlib.argmax (integer key, float8 value)

 

 

...

3. Functions and Languages

To simplify this guide, we'd like to introduce three categories of user-defined functions:

  • UDAs - user-defined aggregates, which perform a single scan of the data source and return an aggregated value for a group of rows. All  All UDA component functions should be written in PL/C (C/C++) for performance and portability reasons.

  • Row Level UDFs - functions that operate on their arguments only and do not dispatch any SQL statements. These  These functions generate a result for each argument set, or for each tuple they are executed on. Recommended language is the same as for UDAs.

  • Driver UDFs - functions that usually drive an execution of an algorithm, and may perform multiple SQL operations including data modification. In  In order to make this part of the code portable we suggest using PL/Python wrapper functions based on plain Python modules. The  The DB access inside the Python modules should be implemented using "classic" PyGreSQL interface ([[http://www.pygresql.org/pg.html]]).

This topic will be covered in much more detail in [[Design Patterns & Best Practices]].*

44. Function Name Overloading

...

Some platforms (like PostgreSQL) allow one to use the ANYELEMENT/ANYARRAY datatype, which can be used by MADlib routines (whenever it makes sense) in order to minimize code duplication.

If ANYELEMENT/ANYARRAY functionality is not available or not feasible, function - name overloading can be used for different argument data types. For  For example, function F1 from module M1 can have the following versions:

...

Code Block
languagesql
madlib.m1_f1( arg1 TEXT)

 *

  • NUMERIC data type example:
Code Block
languagesql
madlib.m1_f1( arg1 BIGINT/FLOAT/etc.)

 

Argument Sets

Overloading mechanisms should also be used for different sets of parameters. For  For example, if (reqarg1, ..., reqargN) is a set of required parameters for function F1 from module M1, then the following definitions would be correct:

...

  • A version for both required and optional arguments:

 

Code Block
languagesql
madlib.m1_f1( reqarg1, ..., reqargN, optarg1, ..., optargN)

...

  • Should follow the naming conventions described in section Section 2.
  • Should follow the function overloading rules as described in section Section 4. On  On Greenplum and PostgreSQL this can be achieved via PL/Python wrapper UDFs based on the same main Python code.

...

Parameters of the execution should be supplied directly in the function call (as opposed to passing a reference ID to a parameter-set stored in a table); for .  For example: 

Code Block
languagesql
SELECT madlib.m1_f1( par1 TEXT/INT/etc, par2 TEXT[]/INT[]/etc, ...)

...

  • Can be located in any schema as long as the database user executing the function has read permissions.
  • Should be defined in the method documentation; for .  For example:
Code Block
languagesql
TABLE|VIEW (
  col_x INT, 
  col_y FLOAT, 
  col_z TEXT
)

...

  • The input relation and its attributes needed by the function should be validated using primitive funtctions functions from the helper.py module. See section Section 5.4 for more information.

...

We recommend using Standard Output to return a predefined single record structure for all cases when the results of a method or a model definition is in a human readable format. See examples below:

 

Code Block
languagesql
SELECT mregr_coef(price, array[1, bedroom, bath, size]) from houses;
             mregr_coef             
------------------------------------
 {27923.4,-35524.8,2269.34,130.794}

 

130.794}

 

  • Returning results (FM SketchReturning results ([[FM Sketch|http://doc.madlib.net/groupgrpfmsketch.html]]):
Code Block
languagesql
SELECT madlib.fmsketch_dcount(pronargs) FROM pg_proc;  
 fmsketch_dcount 
-----------------
              10

...

Code Block
languagesql
CREATE TYPE madlib.kmeans_result AS (
  clusters          INTEGER,
  pct_input_used    PERCENT,
  output_schema     TEXT,
  cluster_table     TEXT,
  point_table       TEXT
);

...

 

2) If using the recommended PL/Python language (see section Section 3 for more info) you can use the following example to generate a single row of output inside a Python routine:

...

If specified by the user (verbose flag/parameter), long-running methods can send runtime status to the log. But be aware that this information may not be propagated to clients in many cases, and it will enlarge the stored log file. Informational  Informational logging should be turned off by default, and activated only with an explicit user command. Use  Use plpy.info( message) function (see section Section 6.1) to properly generate information logs. Example  Example log output:

Code Block
languagesql
SQL> select madlib.kmeans_run( 'my_schema.data_set_1', 10, 1, 'run1', 'my_schema', 1);
INFO: Parameters:
INFO:  * k = 10 (number of centroids)
INFO:  * input_table = my_schema.data_set_1
INFO:  * goodness = 1 (GOF test on)
INFO:  * run_id = run1
INFO:  * output_schema = my_schema
INFO:  * verbose = 1 (on)
INFO: Seeding 10 centroids...
INFO: Using sample data set for analysis... (9200 out of 10000 points)
INFO: ...Iteration 1
INFO: ...Iteration 2
INFO: Exit reason: fraction of reassigned nodes is smaller than the limit: 0.001
INFO: Expanding cluster assignment to all points...
INFO: Calculating goodness of fit...
...

 

...

5.4. Parameter Validation

...

For simple arguments (scalar, array) sanity checks should be done by the author. Some  Some common parameters with known value domains should be validated using SQL domains, for general use, for .  For example:

  • Percent
Code Block
languagesql
CREATE DOMAIN percent AS FLOAT
CHECK(
   VALUE >= 0.0 AND VALUE <= 100.0
);
  • Probability

 

Code Block
languagesql
CREATE DOMAIN probability AS FLOAT
CHECK(
   VALUE >= 0.0 AND VALUE <= 1.0
);

For table/view and column arguments please see section Section 6.2 (describing usage of the helper.py module).

...

This module serves as the database access layer. Even though currently not used this module will provide easy portability between various MADlib platforms and interfaces. To clarify: PostrgreSQL PL/Python language currently uses an internal plpy.py module to implement seamless DB access (using "classic" PyGreSQL interface - see [[http://www.pygresql.org/pg.html]]). By  By adding a MADlib version of plpy.py we'll be able to more easily port code written for MADlib.

...

Code Block
languagepy
   def connect ( dbname, host, port, user, passwd)
   def close()
   def execute( sql)
   def info( msg)
   def error( msg)

...

6.2. Python abstraction layer (TO DO)

...

  • table/view existence check

...

 

Code Block
languagepy
    def __check_rel_exist( relation):
        if relation ~ schema.table:
           check if exists
        if relation ~ table:
           find the first schema using SEARCH_PATH order
        returns:
           - (schema,table)
           - None (if not found)
  • relation column existence check (assuming relation exists)

...

Code Block
languagepy
    def __check_rel_column( relation, column):
        returns:
           - (schema,table,column)
           - None (if not found)
  • relation column data type check (assuming table & column exist)

 

...

Code Block
languagepy
    def __check_rel_column( relation, column, data_type):
        returns:
           - (schema,table,column,type)
           - None (if not found)