Page tree
Skip to end of metadata
Go to start of metadata


The Apache MetaModel CSV module is one of the most advanced implementations there is, compared to how simple a file format CSV is. The implementation's main features are:

  • Full implementation of DataContext and UpdateableDataContext.
  • Streaming query support without memory leaks, tested on billion-record data sets.
  • Support for parallelized row-parsing when multiline values are turned OFF. In these cases the Row objects served for queries have not yet been parsed, making this a potential parallel data consumption activity. 
  • Support for sample-based COUNT queries when the query's COUNT select item has the "allow function approximation" flag set. This means that applications can get a quick approximation of the amount of rows, even in a really big file.

Creating from plain old java code - CsvDataContext

This is really simple:

Creating from properties - CsvDataContextFactory

If you wish to construct your CSV DataContext from properties, this is also possible. For instance:

The relevant properties for this type of instantiation are:

PropertyExample valueRequiredDescription
(tick)Must be set to 'csv' or else another type of DataContext will be constructed.
(tick)Must reference the resource path to read/write CSV data from/to.
 The enclosing quote character to use for values in the CSV file.
 The separator character to use for separating values in the CSV file.
 The escape character to use for escaping CSV parsing of special characters.
 The character set encoding of the data.
 The line-number which holds column names / headers.
 Whether or not to fail (throw exception) on inconsistent row lengths, or to suppress these parsing issues.
 Whether or not the data contains values spanning multiple lines (if this never happens, a faster parsing approach can be applied).

Updating CSV data

Modifying CSV data is done just like with any other MetaModel module - by means of implementing your an update script that is then submitted to the UpdateableDataContext's executeUpdate(...) method. This approach guarantees isolation and coherence in all update operations. Here is a simple example:

File myFile = new File("unexisting_file.csv");

UpdateableDataContext dataContext = DataContextFactory.createCsvDataContext(myFile);
final Schema schema = dataContext.getDefaultSchema();
dataContext.executeUpdate(new UpdateScript() {
  public void run(UpdateCallback callback) {

    Table table = callback.createTable(schema, "my_table")
    callback.insertInto(table).value("name","John Doe").value("gender",'M').value("age",42).execute();
    callback.insertInto(table).value("name","Jane Doe").value("gender",'F').value("age",42).execute();

If you just want to insert or update a single record, you can skip the UpdateScript implementation and use the pre-built InsertInto, Update or DeleteFrom classes. But beware though that then you don't have any transaction boundaries or isolation inbetween those calls:

Table table = schema.getTableByName("my_table");
dataContext.executeUpdate(new InsertInto(table).value("name", "Polly the Sheep").value("age", -1));
dataContext.executeUpdate(new Update(table).where("name").eq("Polly the Sheep").value("age", 10));
dataContext.executeUpdate(new DeleteFrom(table).where("name").eq("Polly the Sheep"));

... And just to go full circle, here's how you can continue to explore the data:  

System.out.println("Columns: " + Arrays.toString(table.getColumnNames()));
DataSet ds = dc.query().from(table).select(table.getColumns()).orderBy(table.getColumnByName("name")).execute();
while ( {
   System.out.println("Row: " + Arrays.toString(ds.getRow().getValues()));

This snippet will print out:  

Columns: [name, gender, age]
Row: [Jane Doe,F,42]
Row: [John Doe,M,42]
  • No labels