DFDL provides user-visible regular expressions with a feature set that includes the intersection of the features of Java 7 regular expressions, and the regular expression features implemented by the ICU libraries.
However, this page is specifically about internal use of regular expressions by Daffodil in its implementation of delimiter matching.
Features needed
- Supports matching against binary data (not just character data)
- Implements POSIX longest leftmost match algorithm
- Uses non-backtracking algorithm to avoid exponential worst-time behavior
Survey of Java regular expression libraries
Library | Algorithm | POSIX longest leftmost match | License | URL |
java.util.regex | Backtracking | No | Java | (included as part of Java) |
Jakarta ORO | Backtracking | No | Apache | http://attic.apache.org/projects/jakarta-oro.html |
Jakarta Regex | Backtracking | No | Apache | http://attic.apache.org/projects/jakarta-regexp.html |
com.stevesoft.pat | Backtracking | Unknown | LGPL | http://www.javaregex.com/home.html |
dk.brics.automaton | DFA | Unknown | BSD | http://www.brics.dk/automaton/index.html |
JRegex | Backtracking | Unknown | BSD | http://jregex.sourceforge.net/ |
dregex | DFA | Unknown | BSD | https://index.scala-lang.org/marianobarrios/dregex |
Of these, the automaton library looks the most interesting from a DFA perspective.
No POSIX longest match regex library for java/scala (as far as we can tell)
Notes for implementing our own regular expression engine
- Interesting: http://matt.might.net/articles/implementation-of-regular-expression-matching-in-scheme-with-derivatives/
- A follow-on post: A non-blocking lexing toolkit for Scala in less than 800 lines of code, from regex derivatives (http://matt.might.net/articles/nonblocking-lexing-toolkit-based-on-regex-derivatives/)
- A great series of articles by Russ Cox: Implementing Regular Expressions (http://swtch.com/~rsc/regexp/)
Regular Expression Matching Can Be Simple And Fast (http://swtch.com/~rsc/regexp/regexp1.html) - Introduces several implementation techniques
Regular Expression Matching: the Virtual Machine Approach (http://swtch.com/~rsc/regexp/regexp2.html) - Discusses several implementation details, including a section on POSIX longest leftmost matching. Includes some links to some test suites for POSIX matching rules.
Regular Expression Matching in the Wild (http://swtch.com/~rsc/regexp/regexp3.html) - Tour of the re2 library (written in C++) - useful for translating the implementation to Java or Scala (which no one has apparently done yet)
- Someone started a binary parser and pickler combinator library. https://github.com/teigen/byteme Still looks immature, but might provide some good ideas.
- I started a Scala implementation of a regular expression language specific to XML Schema last year. It uses the automaton classes in scala.util.automata. It's a really basic implementation, but I'd be happy to contribute it. The automata classes in particular are not specific to characters, so they could be adapted to match binary data instead. --Jonathan Cranford
Miscellaneous notes:
java scanner - takes Readable as input (good), doesn't allow regex that matches at beginning of stream.
Implement longest match by rotating through mulitiple regex matchers, which determine if the pattern matches immediately at the beginning of the supplied characters.
Java regex Matcher class takes only CharSequence as input.
CharSequence is finite (has length() method) which precludes (most likely) creating a virtualized stream-like behavior underneath our own CharSequence implementation.
---------------------
Scala ||| combinator of regex parsers produces longest match to the combined regex's.
Scala combinator parsers read from a Reader.
The reader can virtualize the character stream up until a decode error, or size limit of our choice, and can hide all details of when additional blocks of data must be retrieved.
val rdr = new InputStreamReader(Channels.newInputStream(rbc : ReadableByteChannel)
// Must figure out what happens if character decoding gets an error.
// Worst case, we have to implement Reader interface ourselves with code that calls the decoder.
// length limit on delimiter match could be implemented by creating a finites substream from the rbc, instead of just handing the original rbc to it.
val pr = Parser.parse(rdr)