-102024-07-17

The Daffodil runtime performance, as of version 3.8.0, is too slow.

This memo describes a proposed set of studies and actions to be taken to improve the performance.

Parsing speed is the first target.

Runtime Performance Goals

Our goal is to get within some factor of the speed of a reasonable handwritten parser/unparser. 

Only very extreme performance needs should require creating a handwritten Java parser/unparser for a format. Daffodil should be sufficient for most cases.

The generality of DFDL, e.g., every field could have a different charset encoding, should not create a performance penalty for the far more common case where only one charset encoding, or only a few, are used throughout the schema. Similarly, DFDL allows bit and byte order to change frequently, but in most realistic DFDL schemas they will either not change, or will change once per parse/unparse top-level call due to a envelope/payload idiom where the envelope format uses different bit/byte order than the payload. 

Below are areas of investigation and specific steps to take to help determine where to focus efforts to improve parsing performance. Some of these areas are based on initial investigations or testing that revealed potential areas of slowness. Others are based on areas where more information is needed to guide further investigations.

ARINC44 Performance

User reports that Daffodil 3.7.0 parsers a 5-million record ARINC424 file in 12 minutes, whereas the same file takes under 5 minutes with industry tools. We need to better understand Daffodil performance compared to "ideal" performance and come up with a realistic goal for Daffodil.

ACTIONParse using the existing ARINC424 schema, confirm poor performance and determine potential environmental variables (e.g. CPU, memory) that could affect this.
RESULT

Tested with Daffodil 3.8.0, parsing a ~5 million record file (~640 MiB) using a pre-compiled binary and creating an XML infoset redirected to /dev/null. JVM is given 9GB. JVM memory usage never exceeded 3GB.

Parse times:

  • ARINC-choiceDispatch.dfdl.xsd: 230 seconds (3.8 minutes)
  • ARINC-speculative.dfdl.xsd: 1025 seconds (17.0 minutes)

The speculative parsing schema is significantly slower, which is not unexpected. The way the speculative schema is written, it first attempts to parse a line as specific record type. If any fields are invalid for that record (e.g. Subsection Code, Application Type), then it backtracks to the beginning of the line and tries the next record type. Worst case this is O(N2) to find the correct record type. And each attempt at a record usually requires parsing multiple fields before before determining it parsed the wrong record type and backtracks. One possible fix for this is to for Daffodil to detect which fields are discriminated on, find a parse those bits, and then exclude branches that are known to fail the discriminator–essentially automating something similar to choice dispatch. The complexity of such an algorithm might not be worth the effort. We may just recommend that speculative parsing only be used as a last resort, and to use choice dispatch where possible.

User reported that choice dispatch variant parsed in 3.5 minutes and the speculative parsing parsed in 15 minutes, which matches the above results. Poor performance is likely not due to different input files, slower CPU, insufficient memory, or some other environmental factor.

ACTIONCreate a java "reasonable handcoded" implementation to compare to
RESULT


ACTIONObtain existing tools (some open source libraries are available: https://github.com/topics/arinc424) and compare
RESULT

Binary vs Textual

Although poor performance has mainly been reported with text-based ARINC424, we must determine if the poor performance is specific to text or is a more general issue that affects binary formats as well

ACTIONCrate a Java "similar binary" format--same number of fields as ARINC424, but binary data for all numeric values, enums for symbolic values. I.e., minimize strings.
RESULT


ACTIONCreate a DFDL schema that parses this binary format, and compare Daffodil performance with ideal Java implementation
RESULT

Tested with Daffodil 3.8.0, parsing a ~5 million record arinc text file (~617 MB) and a ~5 million record binary file ( ~603 MB) using a pre-compiled binary and creating an XML infoset redirected to /dev/null. JVM is given 9GB. JVM memory usage never exceeded 4GB.

Parse times:

  • ARINC-choiceDispatch.dfdl.xsd (text): 222 seconds (3.7 minutes)
  • binaryFormat-choiceDispatch.dfdl.xsd (binary): 114 seconds (1.9 minutes)

InfosetOutputters

Converting the internal infoset to XML, EXI, Json, etc. incurs overhead that some testing shows can be fairly large. We must isolate the time required to parse with no infoset, and determine the overhead to write XML/EXI/etc. via various InfosetOutputters.

ACTIONRun daffodil with Null, XML, and EXI InfosetOuputters and calculate overhead of XML/EXI
RESULT

Tested with Daffodil 3.8.0, parsing a ~5 million record file (~640 MiB) using a pre-compiled binary and creating different infosets redirected to /dev/null. JVM is given 9GB.


ARINC-choiceDispatch.dfdl.xsd

ARINC-speculative.dfdl.xsd
XML151 seconds (2.51 minutes)591 seconds (9.85 minutes)
EXI189 seconds (3.15 minutes)

635 seconds (10.6 minutes)

EXI SA174 seconds (2.9 minutes)

613 seconds (10.2 minutes)

NULL126 seconds (2.1 minutes)549 seconds (9.1 minutes)

With speculative parsing, the overhead of outputting to either XML or EXI seems to be overshadowed by excessive backtracking and I/O operations. But with the more efficient choice dispatch schema, the overhead becomes more more obvious, incurring an additional 30-60% overhead. Improving the speed of our InfoseOutputters could give noticeable gains. Note however, that EXI is a separate library that we do not have the resources to optimize. And the XML InfosetOutputter is fairly simple–finding places to optimize might be difficult.


Profiling found possible overheads when remapping strings to the Unicode private use area (PUA) when converting to XML. This is on a format where remapping should not be necessary. It's possible this is just due to scanning to determine if PUA characters are exist. We should investigate this algorithm and determine if any improvements are possible. Note that remapping only occurs for xs:string types and types derived from xs:srtring, so any changes would likely not affect performance for primarily binary/numeric formats.

ACTIONParse a file that does not need any remapping, but do so once with mapping logic enabled and again with all remapping logic commented out. This should provide an estimate of how much overhead exists even in the case when mapping is not required. Based on the results, determine if any optimizations can be made to the remap algorithm, or if it is possible to determine if the remap logic can be skipped entirely in some cases.
RESULTRemoving the line to that remap illegal characters to PUA in XMLTextInfosetOutputter showed no noticeable difference in performance.
ACTIONAside from illegal XML mapped to PUA, some characters are legal but must be escaped, such as <, >, &, and quote. We should comment out the code that checks if any of these characters exist and if they need escaping and determine the overhead and if there is anything we can do to minimize that.
RESULTRemoving character escaping showed node noticeable difference in performance.

EXI output is notably slower than XML output. We need to determine if this is because of something in Daffodil or the EXI library.

ACTIONFor EXI output, Daffodil connects the SAXInfosetOutputter to the Exificient library. We should modify -I sax so that it ignores all SAX events, and compare that with -I null . And then compare those numbers with -I exi . This should help us determine if the overhead is in our SAXInfosetOuputter or if it is the EXI library.
RESULTModifying -I sax to ignore SAX events resulted in performance that was similar to using -I null.  Therefore it seems that the overhead of using -I exi is not coming from how we handle creating SAX events and the overhead is only in the external Exificient library.
ACTIONInvestigate the output stream with -I exi  and confirm that output is buffered. If not, modify it so it is buffered and compare the results. Buffered output has been known to make a noticeable difference for the XMLTextInfosetOutputer, so the same may be true for EXI.
RESULTVerified by looking through the code in Exificient and confirmed that Exificient will buffer the output if it is not already buffered.

BucketingInputSource vs ByteBufferInputSource

Daffodil has two kinds of input sources. The first is a BucketingInputSource which streams data from an InputStram and supports unbounded data size (assuming things like no excessive backtracking, infoset can be streamed out and resources freed, etc.). The second is a ByteBufferInputSource, which requires all bytes be read into memory and has a limit of 2GB. It has been observed that the BucketingInputSource is noticably slower than the ByteBufferInputSource.

ACTIONParse multiple formats using a ByteBufferInpuSource vs a BucketingInputSource to quantify the performance differences.
RESULT

Modified Daffodil CLI to use a ByteBufferInputSource backed by a MappedByteBuffer for a 640MiB ARINC file with a null infoset outputter. Compared against unmodified Daffodil using a BucketingInputSource. Also compared against reading all bytes into an array and wrapping the array in a ByteBuffer to compare mapped vs non-mapped.


ARINC-choiceDispatch.dfdl.xsdARINC-speculative.dfdl.xsd
BucketingInputSource180 seconds (3.0 minutes)865 seconds (14.4 minutes)
ByteBufferInputSource (mapped)151 seconds (2.5 minutes)745 seconds (12.4 minutes)
ByteBufferInputSource (non-mapped)156 seconds (2.6 minutes)795 seconds (13.2 minutes)

With choiceDispatch, there appears to be about 20% overhead when using the BucketingInputSource. The overhead is less, but still noticeable, with the speculative parsing schema, but that is likely because the excessive backtracking players more of a role than the input source.

It would likely be worth investigating the bucketing algorithm for potential improvements.


ACTION Investigate algorithmic changes to the BucketingInputSource that could speed it up
RESULT

See DAFFODIL-2920 and PR #1273 (this has been merged), which reduced overhead compared to ByteBufferInputSource from about 15% to 5%.

Note that a change that was not included was an increase in the default bucket size to something larger than 8k. I tested various default sizes ranging as high as 68MB and did not see a change in performance. This may be worth reinvestigation with different format types and file sizes, but being relatively close in performance to the ByteBufferInputSource, we are likely to only see minimal gains at best.

A large suite of tests was run with this change and compared. There are too many numbers to reasonably include and many schemas are not publicly available, but based on the results we can conclude a few things:

  1. This change in the mentioned PR did make a noticeable difference for a large group of files, ranging from just a few percent faster up to 15-25+% faster.
  2. In many of the cases where improvements were seen, the gap between bucketing and non-bucketing was less than 5% slower, usually only 1-2%. If performance is critical one should use the non-bucketing algorithm, but if switching to that is cannot be done (e.g. streaming or files larger than 2GB), then in most cases the overhead is fairly minimal.
  3. A small number of formats did not see benefits from this change, or did but still experienced slower performance with the bucketing algorithm. Of the publicly available schemas, this includes BMP (no improvement) and NITF (5-10% improvement but still 20% slower). Additional investigations are needed to determine issues specific to these formats.


ACTIONIf still a performance gap, modify the CLI so parsing files under 2GB uses the ByteBufferInputSource backed by a MappedByteBuffer–this takes advantage of operating system optimizations when mmap-ing files and avoids BucketingInputSource overhead. Add documentation about overhead related to BucketingInputSource.
RESULT

A small gap of about 5% still remains after optimizing the BucketingInputSource.

See DAFFODIL-2921 and PR #1274, which modifies CLI parse to use a mapped byte buffer wen parsing files <= 2GB.

Decoding Performance

Character decode performance is known to be quite slow. Part of this is because our character set decoding algorithms are custom written, which are likely unoptimized and read single bytes and decode single characters at a time. A new draft implementation modifies the Daffodil decoders to use built-in Java decoders and decode larger chunks of bytes at a time. Not only does this avoid byte-at-a-time decoding, it also reducing maintenance since we can rely on Java decoders. Initial results show that this can make a vast improvement in raw decode speed, becoming closer to ideal Java decode performance.

Note that performance improvements likely only help when decoding large strings. If a file is made up of many small strings (<5 characters), decoder performance likely will not make a drastic difference. Files like ARINC424 are mostly small strings, so may not see a large benefit from this change, but this has not been thoroughly investigated yet.

Caveats:

  • The draft implementation only changes character sets where Java has a built-in decoder (e.g. ASCII, UTF-8). Other character sets, such as densely packed non-byte size military character sets, still use the old single byte/char algorithm. However, formats that use these character sets do not make heavy use of strings, and so performance improvements to these decoders would likely see little value. Improvements to these should only be done if this is found to be a bottleneck.
  • Daffodil's delimiter scanner relies on the old decoders that decode one character at a time. In order to merge the new decoders, we likely need to change the delimiter scanning implementation so it scans for bytes instead of characters, removing the reliance on character decoding. This likely needs to be done anyways to match the DFDL specification which requires this behavior. This may also speed up delimiter scanning since it avoids decoding characters. Note that formats like ARINC424 do not make heavy use of delimiters, so likely would not see an improvement with this change.
ACTIONCompare parsing ARINC424 with and without draft implementation. Changes will likely be needed to remove delimiter from the schema.
RESULT

Tests saw about a 9-10% increase in performance when using the draft decode implementation. Note that ARINC is primarily made up of relatively short strings, so a relatively small increase in performance is not unexpected, since decoder performance improves the most when parsing fairly long strings. That said, 10% is still a noticeable improvement and worth incorporating into Daffodil, especially considering some text formats might have fairly long strings.

Note, later investigations reported in the "Large Number of Infoset Elements" section compare Daffodil 3.8.0 with and without the draft decoder change (labeled "Updated Variant") with non-ARINC data/schema. Although there is some improvement when decoding smaller shorter strings, significant performance increases aren't observed until larger strings (e.g. 10+ characters) are decoded. This may indicate that the majority of overhead is not in decoders and lies elsewhere, but also shows that improvement to decoders can lead to significant performance improvements in specific cases.

The primary barrier to merging the draft is that delimiter scanning only works for fixed width encodings, which means no UTF-8 delimiter scanning. Supporting this with the new decoders likely requires rewriting the delimiter scanning logic to be byte based rather than character based. This is a worthwhile endeavor as it will probably also speed up delimiter scanning, but this is likely not trivial to implement.

ACTIONRewrite delimiter scanning to be byte-based rather than character based. This should improve performance--avoids double decoding as we do now and allows use of faster decode algorithms.
RESULT


Entity Replacement

The ARINC424 schema was found to do things like dfdl:choiceDispatchKey="{ dfdl:encodeDFDLEntities(../foo) }", which is needed because ../foo could contain spaces. This results in multiple "cookers" that are slow enough to show up while profiling, since these cookers are relatively complex and make heavy use of regular expressions.

ACTIONModify expressions in ARINC424 to do something like dfdl:choiceDispatchKey="{ if (../foo eq ' ') '__SPACE__' else ../foo }". This makes for more complex expressions but converts the dispatch values to not require any cooking. Determine the performance impact.
RESULT

The only character that needs to be encoded in the schema and test data is a 'space' character used in choiceBranchDispatch/Key, which must be encoded to %SP;.

To test the overhead of handling 'space', the following command is run to avoid the call to dfdl:choiceDispatchKey by manually encoding the space character.

sed -i "s/dfdl:encodeDFDLEntities(\(.*\))/if (\1 eq ' ') then '%SP;' else \1/" src/main/resources/org/mitre/arinc424/xsd/ARINC-choiceDispatch.dfdl.xsd

This leads to a slightly more complex expression, but avoids complexity related to the encodeDFDLEntities function, which must support many more characters than just 'space'.

Test were run with this modified schema.

Then the following command was run to remove instances of "%SP;" used in either the choiceDispatchKey or choiceBranchKey. This avoids a great deal of complexity related to replacing entities.

sed -i "s/%SP;/__SPACE__/" src/main/resources/org/mitre/arinc424/xsd/ARINC-choiceDispatch.dfdl.xsd

Tests were run with this modified schema.

The results (using -I null since this change does not affect infoset output)

  • Original schema: 159 seconds (2.6 minutes)
  • No encodeDFDLEntities: 158 seconds (2.6 minutes)
  • No encodeDFDLEntities or %SP;: 141 seconds (2.3 minutes)

There was virtually no difference in removing encodeDFDLEntities–the overhead of that function must be minimal.

However, by removing the use of "%SP;" and thus a runtime entity replacement of SP, we get about a 10% increase. Note that this is a particularly bad case, since about 370k of the 500k lines require evoking the EntityReplacer. So improvements in the EntityReplacer may not see performance improvements in other formats. That said, entity replacement is currently implemented using a variety of complex regular expression which very likely could be replaced with something more efficient.


ACTIONInvestigate cooker logic for possible optimizations.
RESULT

Large Number of Infoset Elements

Testing revealed a schema where a few large elements parses significantly faster than parsing many small elements. This is not totally unexpected, since more elements in an infoset will lead to more steps to parse data, more allocations to create the infoset, more time spent walking the infoset and converting to XML, etc., but tests seems to scale poorly. A number areas to consider for the cause:

  • The infoset data structure has excessive allocations per element
  • The infoset walker/outputters have excessive overhead to convert the internal infoset to XML/EXI
  • Individual parsers have excessive overhead
  • The InputSource has excessive overhead for individual reads
ACTIONCreate a simple schema that parses the same size data, but either as many small strings or a few large strings (where small is on the order of 1-5 and large is 50-100), and compare differences
RESULT

Tested with Daffodil 3.8.0 and updated character decoding variant, parsing a ~2 million record file (~700 MB) using a pre-compiled binary and creating different infosets redirected to /dev/null. JVM is given 9GB.


Element Character Size (Number of Elements)Daffodil 3.8.0Updated Variant
1 (400)435s (7.25min)406s (6.76min)
2 (200)215s (3.58min)205s (3.42min)
5 (80)97s (1.62min)84s (1.4min)
10 (40)61s (1.02min)45s (0.75min)
25 (16)40s (0.67min)25s (0.42min)
50 (8)37s (0.62min)16s (0.27min)
100 (4)32s (0.53min)




strings1.dfdl.xsdstrings2.dfdl.xsdstrings5.dfdl.xsdstrings10.dfdl.xsdstrings25.dfdl.xsdstrings50.dfdl.xsdstrings100.dfdl.xsdmkbig.sh


ACTIONCreate a simple schema that parses to the same number number of elements in the infoset, with the schema have 1, 2, 5, 10, 25, 50 or 100 characters per element, and compare differences
RESULT

Tested with Daffodil 3.8.0 and 3.9.0-SNAPSHOT, parsing files that results in a 16 million element infoset file using a pre-compiled binary and creating different infosets redirected to /dev/null. JVM is given 9GB.

Element Character SizeInput File Size
116MB
232MB
580MB
10160MB
25400MB
50800MB
1001.6GB


Element Character Size (Number of Elements in Schema)Daffodil 3.8.0Updated Variant
1 (400)15s (0.25min)15s (0.25min)
2 (200)17s (0.28min)17s (0.28min)
5 (80)17s (0.28min)15s (0.25min)
10 (40)20s (0.33min)15s (0.25min)
25 (16)27s (0.45min)17s (0.28min)
50 (8)39s (0.65min)19s (0.32min)
100 (4)60s (1min)23s (0.38min)



Tested with Daffodil 3.8.0 and 3.9.0-SNAPSHOT, parsing files that results in a 160 million element infoset file using a pre-compiled binary and creating different infosets redirected to /dev/null. JVM is given 9GB.

Element Character SizeInput File Size
1160MB
2320MB
5800MB
101.6GB
254GB
508GB
10016GB


Element Character Size (Number of Elements in Schema)Daffodil 3.8.0Updated Variant
1 (400)115s (1.91min)116s (1.93min)
2 (200)133s (2.22min)125s (2.08min)
5 (80)135s (2.25min)119s (1.98min)
10 (40)160s (2.67min)121s (2.0min)
25 (16)232s (3.87min)138s (2.3min)
50 (8)369s (6.15min)149s (2.48min)
100 (4)587s (9.78min)195s (3.25min)



strings1.dfdl.xsdstrings2.dfdl.xsdstrings5.dfdl.xsdstrings10.dfdl.xsdstrings25.dfdl.xsdstrings50.dfdl.xsdstrings100.dfdl.xsd




ACTIONProfile to determine if any hot spots are found that can be optimized
RESULT

Excess Allocations

Allocations have been known to cause slowdowns, especially in inner loops. We should profile Daffodil for memory allocations and determine if any can be removed or replaced with a different approach that does not require memory allocations.

ACTIONProfile and remove allocations in inner loops
RESULTA number of inner loop allocations were fairly easily removed, specifically:
  • Tuple returned by parseOneInstanceWithMaybePOU
  • Lambda allocations in InfosetWalker when calling doOutputter
  • Conditionally call captureValueLength (which allocates a ValueLengthState) in various string parsers so that we do not capture if it it isn't actually needed by any expressions

The combination of these changes made virtually no difference in performance (at best 2-5%, and usually with in the margin of error) and made the code much more complex. While these allocations may be important to remove in the future as larger bottle necks are found and fixed, these are within the noise and are not worth the added complexity for now.

  • No labels