Proposal: Recovery from Parse Errors

Problem Statement and Requirements

The problem is that we are parsing a stream of messages from an input stream. Bad data is encountered. Can we recover and somehow find our way to the next good data and continue parsing more correct data?

At minimum the parse should not hang when trying to recover.

We want to characterize the situations when recovery is possible, and how it might be carried out.

We assume there is no back path to the sender, the sender being the system that is putting messages (or bad data) onto the stream that the Daffodil parser is reading from.

We assume that we can break/close the connection so that the sender must reconnect. However, we assume the sender has no idea if the data they are sending is good or bad. Forcing them to break and reconnect may only cause them to retransmit bad data on the new connection.

So, the sending-side system doesn't know it is sending bad data. It's just sending, sending, sending....

The consuming system (Daffodil parsing) has to cope and not fail no matter what is being sent. It can log errors, but it can't hang or fail.

Requirements on the DFDL Schema

Any "message format" DFDL schema has to not consume to end-of-stream in the case of malformed data. That is, a parse() call cannot hang.

The DFDL schema has to be designed to recover by having an element definition for an "unrecognized" element, available as a last choice branch.

A good design principle for DFDL schemas is to have an external variable which controls whether the parse fails on unrecognized data, or the schema provides for creating and populating an "unrecognized" element with the unrecognized data.

The question is, if the data is unrecognized, how do we know how much to capture from the input stream?

The answer depends on the nature of the DFDL schema-described format, and whether we have any information about how much malformed data to consume from the data stream.

Failure Modes

The principle here has to be that the amount of data consumed is always at least bounded by some upper limit.

There are two kinds of failures to consider:

Unknown structure - schema is parsing successfully, header is recognized but some unknown payload within the message is unrecognized. There's two variants here:
1. we parsed enough to have positive knowledge of the message length
2. length cannot be determined positively
Garbage - parse is failing - This means some fields in the header are malformed so we can't even parse the header.

Unknown Payload - Known Length

In case (1a), we can determine the length, so we can consume that much into a unrecognized element. This is the easy case. We do need to deal with the case where the given length is entirely unreasonable however, as corrupted data, the corruption of which falls exactly within the length information of the header, may have a vast length value that is entirely unreasonable. So the length needs to be limited to some finite maximum. If the length exceeds this then we can treat the case the same as the Unknown Payload- Unknown Length case below, or the Garbage data case below.

Unknown Payload - Unknown Length

In case (1b) the ability to recover depends on ability to determine the length somehow. This is schema-specific. In case of textual data, useful heuristics are looking for a line-ending, form-feed, or other "likely terminator", or something that signifies likely end of the unrecognized payload. It also could look for a few bytes that are typical at the start of a new message. This is just a heuristic though.

But there also has to be a finite upper bound. A value might be chosen based on the minimum/maximum size of these messages if that is knowable.

I think if the element that grabs the unrecognized part of the message uses lengthKind="pattern" to express these heuristics, then daffodil's tunable limits on regex size matching will actually suffice to make this finite, and then those can be tuned depending on system needs.

So long as it has an upper bound on the length it will consume, that would make it not hang so long as the sender system continues to send more data.

If it cannot find some heuristic end of message within some number of bytes, it should probably just parse the header fields (if possible), and not parse any additional data, and try parsing again immediately after the header. Note that if the unrecognized element uses lengthKind pattern, this would be the default behavior, as failure to match is equivalent to finding length 0.

If this comes back length 0, then I think it makes sense to use daffodil's lookahead feature (not in standard DFDL as yet), to grab say the first 8 bytes of the message, for use in the unrecognized element. We may want to distinguish unrecognized with known length, unrecognized with heuristic length, and unrecognized starting with these 8 bytes. That may be overkill.

If there is no reasonable heuristic to determine a likely run of malformed data, then treat as in the Garbage data case below.

Garbage In Data Stream

In case (2), the data is garbage, we get a failed parse. So long as the data stream is not at its end, what we want to do is, advance 1 unit of data. This might be 1 byte for byte-oriented data formats, but could be smaller. It could be 4-bits for hex/nibble oriented data, or even just 1 bit for bit-granularity data. For this discussion we'll assume 1 byte.

So we want to consume 1 bytes and try the parse again.

This can still be done in the DFDL schema, by having a final choice branch that consists of:

<element name="malformed" type="tns:invalidByte"/>

where the type tns:invalidByte is defined as:

<simpleType name="invalidByte" dfdl:representation="binary" dfdl:lengthKind="implicit">
    <restriction base="xs:unsignedByte">
        <maxExclusive value="0"/> <!-- can never pass. Always will be invalid. -->
    </restriction>
</simpleType>

This will create a "well-formed" infoset containing these elements that are named "malformed". One byte of data will be consumed from the input stream.

Validation against the DFDL schema's facets will show that this data is invalid. However, in this case that invalidity really means the data is malformed. Based on this invalidity, an application can discard (possibly logging the issue) and parse again, but it is now starting again 1 byte later. It will continue to consume bytes in this way until it finds something that can parse successfully again, but if there are well formed messages later in the stream, then it should re-synchronize with them.

A good design here would avoid logging for every byte and do some aggregation of consecutive malformed bytes into a log message about a run of such bytes. There might also be a detection of a run longer than N bytes which could trigger a higher level failure which actually stops the processing (perhaps by breaking the connection).

An assumption here is that the parses will likely fail fast until finally one succeeds. Formats that are not highly selective about identifying bad data won't work so well here.

This technique is used in an example on openDFDL: https://github.com/OpenDFDL/examples/tree/master/hexWords

Timeouts - Finite Data Size and Finite Delay

For the 1a and 1b techniques described above, recovery requires that when data is unrecognized and doesn't match the schema, that we can parse an "unrecognized" element which has a finite maximum size so that the process of accumulating this unrecognized data will not take forever.

The finiteness needs also to apply to time. That is, the amount of time to wait while parsing 1 message from the time it gets any data to the time when it gives up must also be finite. So we can hang waiting for 0 bytes, but once we get at least 1 byte, then we start a timer, and if we don't any more bytes, so that we can't finish the parse within the time limit, we should give up. Each time we get some data, even just 1 more byte, the timer should reset, so it's not the total time to parse a message, it's the delay time between receipts of data for a message.

So, imagine the hard bound on max element size was 2K. But the stream blocks trying to get more data to find a terminator, and it hasn't reached the 2K limit yet. How long should we wait? 1 would claim something between 10 seconds and a minute is probably about right.

The key is that after some timeout, the connection is broken, so that daffodil gets an end-of-data for the parser, and both daffodil (listening and reading from) and the sender (connecting and writing to) get a broken connection and have to go around a loop and re-establish the connection again.

A key principle is if you timeout on a read request, that's it for that stream. The connection needs to be re-established from both parties.

Note that this timeout-related technique does not apply to a recovery technique per the Garbage In The Data technique described above, since that technique waits for exactly 1 byte (or less) of data only at a time.

Non-Solution: Full Message Handoffs - Using Delimiters or Headers

It has been suggested that the sending system should somehow separate messages so that we can tell when reading where one message ends and the next begins without having to parse the data.

This can be a valuable technique, but is not a general solution to the problem.

First, not all data can be delimited - binary data generally cannot support delimiters as any given delimiter pattern could appear within legal binary data.

Even when data is textual and can be delimited, there may not be any characters that can be reserved for delimiting. That is, the textual format may allow any character string at all.

Because delimiters don't work for binary data, one must instead encapsulate the data or precede each message with a header that gives the length of the message. This is a common strategy and there are many semi-standard message header formats such as mil-std-2045, JREAP-C, NACT, and SIMPLE which are used with the binary military messaging formats.

One difficulty here is that there remains a problem if the data is corrupted for the header information. We have presumably made the problem simpler because the header format is simpler than the general data format. But we sill have the challenges if the header, with the length information, has been corrupted in transport.

In addition, somebody has to "go first" and be the process that parses raw data messages "the hard way" so that standard headers can be created and the messages enhanced with those headers. It makes sense that this first processing stage is a parser, e.g., DFDL/Daffodil.

So ultimately, headers/delimiters aren't a solution.

Page tree