A Binding is a set of (variable, value) pairs. It corresponds to "solution mapping" in SPARQL.
There are a number of activities that require being about to serialize and read back, bindings.
A shared "bindings I/O" would mean all activities could use one, tuned, set of serialization and I/O classes.
This mini-language incorporates high-level compression (marking duplicates, use of prefixes) so,
unlike formats like SPARQL TSV Results, the number of bytes can be much less.
A sequence of bindings is written assuming there is a list of variables in force.
Position in the row determines which variable is bound to which variable
(=> compression of variable names).
Turtle-style prefixes can be used (=> compression for IRIs) and the value of a slot
in a row can "same as the row before" (=> compression for repeated terms) or undefined.
Rows end in a DOT - this is not strictly necessary but adds a robustness against truncated data and bugs.
Every row is the length, in number of terms, as the list variables in force.
Directives are lines starting with a keyword and also end in DOT.
The directives are:
Like Turtles, except keyword based to fit with being a keyword-driven mini-language.
Blank nodes are encodes as
_:label with the additional rule that "label" is
an encoding of the real label. Reading in BindingIO format preserves labels. N-Triples rules for bNode labela mean
only ASCII letters and numbers are legal, and the label must not start with a digit.
The encoding is:
- The first latter of the label is "B" (this ensures a letter is first)
- Any character outside A-Za-z0-9 is encoded as Xnn where nn is the byte value (after UTF-8 encoding).
- X is encoded as XX.
(@@ revisit this encoding sometime)
There is no BASE directive. IRIs are not treated to IRI resolution.
Set the variables in force for subsequent rows,
until the next VARS directive.
A binding row is a sequence of terms, encoded like Turtle but without using triple quoted strings.
This includes prefixed names and short forms for numbers (more compression).
In addition STAR ("*") means "same term as the row before" and DASH ("-") means undef.
Don't use * for - from previous row.
Rows end in DOT. Preferred style is one space after each term. This makes writing safe.
For presentation reasons only, blank lines are allowed (this would all get lost in the lexing/tokenization anyway).
The format is text - but we're writing strings anyway so a binary form, rather than a delimited text form, is unlikely to give much advantage but can't reuse the standard bytes<->chars stuff without intermediate copies
This would all be hidden behind interface anyway. A binary tokenizer and binary OutputLangUtils would enable binary output.
Dynamic choosing of prefixes can be done.