The Apache MetaModel CSV module is one of the most advanced implementations there is, compared to how simple a file format CSV is. The implementation's main features are:
- Full implementation of DataContext and UpdateableDataContext.
- Streaming query support without memory leaks, tested on billion-record data sets.
- Support for parallelized row-parsing when multiline values are turned OFF. In these cases the Row objects served for queries have not yet been parsed, making this a potential parallel data consumption activity.
- Support for sample-based COUNT queries when the query's COUNT select item has the "allow function approximation" flag set. This means that applications can get a quick approximation of the amount of rows, even in a really big file.
Creating from plain old java code - CsvDataContext
This is really simple:
Resource csvResource = new FileResource("/path/to/my/file.csv"); CsvConfiguration configuration = new CsvConfiguration( // arguments here to fit the resource ); DataContext dataContext = new JdbcDataContext(resource, configuration);
Creating from properties - CsvDataContextFactory
If you wish to construct your CSV DataContext from properties, this is also possible. For instance:
final DataContextPropertiesImpl properties = new DataContextPropertiesImpl(); properties.put("type", "csv"); properties.put("resource", "/path/to/my/file.csv"); DataContext dataContext = DataContextFactoryRegistryImpl.getDefaultInstance().createDataContext(properties);
The relevant properties for this type of instantiation are:
Property | Example value | Required | Description |
---|---|---|---|
type | csv | Must be set to 'csv' or else another type of DataContext will be constructed. | |
resource | /data/stuff.csv | Must reference the resource path to read/write CSV data from/to. | |
quote-char | " | The enclosing quote character to use for values in the CSV file. | |
separator-char | , | The separator character to use for separating values in the CSV file. | |
escape-char | \ | The escape character to use for escaping CSV parsing of special characters. | |
encoding | UTF-8 | The character set encoding of the data. | |
column-name-line-number | 1 | The line-number which holds column names / headers. | |
fail-on-inconsistent-row-length | true | Whether or not to fail (throw exception) on inconsistent row lengths, or to suppress these parsing issues. | |
multiline-values | false | Whether or not the data contains values spanning multiple lines (if this never happens, a faster parsing approach can be applied). |
Updating CSV data
Modifying CSV data is done just like with any other MetaModel module - by means of implementing your an update script that is then submitted to the UpdateableDataContext's executeUpdate(...)
method. This approach guarantees isolation and coherence in all update operations. Here is a simple example:
File myFile = new File("unexisting_file.csv");
UpdateableDataContext dataContext = DataContextFactory.createCsvDataContext(myFile);
final Schema schema = dataContext.getDefaultSchema();
dataContext.executeUpdate(new UpdateScript() {
public void run(UpdateCallback callback) {
// CREATING A TABLE
Table table = callback.createTable(schema, "my_table")
.withColumn("name").ofType(VARCHAR)
.withColumn("gender").ofType(CHAR)
.withColumn("age").ofType(INTEGER)
.execute();
// INSERTING SOME ROWS
callback.insertInto(table).value("name","John Doe").value("gender",'M').value("age",42).execute();
callback.insertInto(table).value("name","Jane Doe").value("gender",'F').value("age",42).execute();
}
});
If you just want to insert or update a single record, you can skip the UpdateScript implementation and use the pre-built InsertInto
, Update
or DeleteFrom
classes. But beware though that then you don't have any transaction boundaries or isolation inbetween those calls:
Table table = schema.getTableByName("my_table");
dataContext.executeUpdate(new InsertInto(table).value("name", "Polly the Sheep").value("age", -1));
dataContext.executeUpdate(new Update(table).where("name").eq("Polly the Sheep").value("age", 10));
dataContext.executeUpdate(new DeleteFrom(table).where("name").eq("Polly the Sheep"));
... And just to go full circle, here's how you can continue to explore the data:
System.out.println("Columns: " + Arrays.toString(table.getColumnNames()));DataSet ds = dc.query().from(table).select(table.getColumns()).orderBy(table.getColumnByName("name")).execute(); while (ds.next()) { System.out.println("Row: " + Arrays.toString(ds.getRow().getValues()));
}
This snippet will print out:
Columns: [name, gender, age] Row: [Jane Doe,F,42] Row: [John Doe,M,42]