Apache Lucene Mahout > index > Collection(De-)Serialization
Added by Steve Rowe, last edited by Ted Dunning on Jan 31, 2008  (view change)

This is a whiteboard for discussing design of collection/dataset (de-)serialization.

Desired Use Cases

  • Supports real-valued dense or sparse matrices with optional row and column labels

This is similar to an R data.frame or matrix. This is a separate use case because it will often require special formats to achieve high performance.

  • Supports dense or sparse matrices containing numbers and strings with optional row and column labels

This is similar to an R data.frame that contains non-numeric data.

  • Supports semi-structured lists of instances which contain named fields whose values are either numbers, strings or semi-structured data structures.

This could be supported by, for instance, the IBM implementation of JSON used by Jaql.

ARFF format