Refreshing Apache Tika's Large-scale Regression Corpus
Since the last efforts to refresh the regression corpus (see ApacheTikaHtmlEncodingStudy and TIKA-2038), Common Crawl has added important metadata items in the indices, including: mime-detected, languages and charset. I opened TIKA-2750 to track progress on updating our corpus, and I describe the steps taken here.
We are enormously grateful to Sebastian Nagel and Common Crawl for using Tika to detect file types and running it on the entire crawl. The synergy of these two open source|data projects is phenomenal.
As always, we're enormously grateful to Rackspace for hosting our regression testing vm.
There are three primary goals of TIKA-2750: include more recent files, include more "interesting" files, and refetch some of the files that are truncated in Common Crawl. I don't have a definition of interesting, but the goal is to include broad coverage of file formats and languages. See below on Coverage Metrics.
While I recognize that the new metadata is automatically generated and may contain errors, this new metadata allows for more accurate oversampling of file formats/charsets that are of interest.
I started by downloading the 300 index files for September 2018's crawl: CC-MAIN-2018-39 (~226GB).
The top 10 'detected mimes' are:
mime |
count |
text/html |
2,070,375,191 |
application/xhtml+xml |
749,683,874 |
image/jpeg |
6,207,029 |
application/pdf |
4,128,740 |
application/rss+xml |
3,495,173 |
application/atom+xml |
2,868,625 |
application/xml |
1,353,092 |
image/png |
585,019 |
text/plain |
492,429 |
text/calendar |
470,624 |
Given the work on TIKA-2038 and the focus on country top level domains (TLDs), I also counted the number of mimes by TLD and the number of charsets by TLD (here).
Finally, I calculated the counts for pairs of 'mime' (as alleged by the http-header) and the 'detected-mime', and that is available here.
Step 1: Select and Retrieve the Files from Common Crawl
My sense from our JIRA and our user list is that people are primarily interested in office-ish files (PDF, MSOffice, RTF, eml, etc) and/or HTML. I therefore chose to break the sampling into three passes:
- PDFs, MSOffice and other office-ish files 2. Other binaries 3. HTML/Text
I wanted to keep the corpus to below 1 TB and on the order of a few million files.
The sampling frame tables are available here; there's one sampling frame for each of the three file classes.
NOTE: I hesitate even to use the terms "sampling and "sampling frame" because I do not mean to imply that I used much rigor. I manually calculated the sampling frames based on the total counts so that we'd have roughly the desired number of files and file types. As I describe below, there some file types that I thought we should have more of (e.g. 'octet-stream').
The code for everything described here is available on github
Office formats
The top 10 file formats of this category include:
mime |
count |
application/pdf |
4,128,740 |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
53,579 |
application/msword |
52,087 |
application/rtf |
22,509 |
application/vnd.ms-excel |
22,067 |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
16,290 |
application/vnd.oasis.opendocument.text |
8,314 |
application/vnd.openxmlformats-officedocument.presentationml.presentation |
6,835 |
application/vnd.ms-powerpoint |
5,799 |
application/vnd.openxmlformats-officedocument.presentationml.slideshow |
2,465 |
select mime, sum(count) cnt from detected_mimes where (mime ilike '%pdf%' OR mime similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%' ) group by mime order by cnt desc
Given how quickly the tail drops off, we could afford to take all of the non-PDFs. For PDFs, we created a sampling frame by TLD.
We used org.tallison.cc.index.mappers.DownSample
to select files for downloading from Common Crawl.
Other Binaries
These are the top 10 other binaries:
mime |
cnt |
image/jpeg |
6,207,029 |
application/rss+xml |
3,495,173 |
application/atom+xml |
2,868,625 |
application/xml |
1,353,092 |
image/png |
585,019 |
application/octet-stream |
330,029 |
application/json |
237,232 |
application/rdf+xml |
229,766 |
image/gif |
166,851 |
application/gzip |
151,940 |
select mime, sum(count) cnt from detected_mimes where (mime not ilike '%pdf%' and mime not similar to '%(word|doc|power|ppt|excel|xls|application.*access|outlook|msg|visio|rtf|iwork|pages|numbers|keynot)%' and mime not ilike '%html%' and mime not ilike '%text%' ) group by mime order by cnt desc
I created the sampling ratios from for these by preferring non-xml, but likely text-containing file types. Further, I wanted to include a fairly large portion of octet-stream so that we might be able to see how we can improve Tika's file detection.
We used org.tallison.cc.index.mappers.DownSample
to select files for downloading from Common Crawl.
HTML/Text
For the HTML/text files, I wanted to oversample files that were not ASCII/UTF-8 English, and I wanted to oversample files that had no charset detected.
We used org.tallison.cc.index.mappers.DownSampleLangCharset
to select the files for downloading from Common Crawl.
The Output
In addition to storing the files, I generated a table for each pull that included information stored in the WARC file, including information from the http-headers as archived in Common Crawl. The three table files are available here (116MB!).
Step 2: Refetch Likely Truncated Files
Common Crawl truncates files at 1MB. We've found it useful to have truncated files in our corpus, but this disproportionately affects some file formats, such as PDF and MSAccess files, and we wanted to have some recent, largish files in the corpus. We selected those files that were close to 1MB or were marked as truncated:
select url,cc_digest from crawled_files where (cc_mime_detected ilike '%tika%' or cc_mime_detected ilike '%power%' or cc_mime_detected ilike '%access%' or cc_mime_detected ilike '%rtf%' or cc_mime_detected ilike '%pdf%' or cc_mime_detected ilike '%sqlite%' or cc_mime_detected ilike '%openxml%' or cc_mime_detected ilike '%word%' or cc_mime_detected ilike '%rfc822%' or cc_mime_detected ilike '%apple%' or cc_mime_detected ilike '%excel%' or cc_mime_detected ilike '%sheet%' or cc_mime_detected ilike '%onenote%' or cc_mime_detected ilike '%outlook%') and (actual_length > 990000 or warc_is_truncated='TRUE') order by random()
A rollup of the files that were to be refetched by mime type is here:
mime |
count |
application/pdf |
121,386 |
application/vnd.openxmlformats-officedocument.presentationml.presentation |
3,929 |
application/x-tika-msoffice |
3,830 |
application/vnd.ms-powerpoint |
2,942 |
application/msword |
2,783 |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
2,722 |
application/x-tika-ooxml |
2,612 |
application/vnd.openxmlformats-officedocument.presentationml.slideshow |
1,663 |
application/rtf |
1,569 |
application/vnd.ms-excel |
1,186 |
The full table is here.
We used org.tallison.cc.WReGetter
, a wrapper around 'wget' to re-fetch the files from the original URL. If the refetched file was > 50MB, we deleted it; and if the refetch took longer than 2 minutes, we killed the process and deleted whatever bytes had been retrieved.
We refetched these files to a new directory and stored them by their new digest. Each thread in WReGetter wrote to a table to record the mapping of the original digest to the new digest and whether the new file was successfully refetched and/or was too big. Because of limitations of disc space, we stopped the refetch procedure after refetching 98,000 documents, comprising 440GB of data.
We then randomly deleted 80% of the original truncated files and moved the other 20% to /commoncrawl3_truncated.
Finally, we moved the refetched files into the /commoncrawl3_refetched directory.
Step 3 – Areas for Improvements
We carried out this work on one of our TB drives. We have to figure out what to keep from our older commoncrawl2 collection and then merge the two collections. We may consider deleting some of the ISO-8859-1/Windows-1252/UTF-8, English text files. We could also identify truncated files based on parser exceptions and move those into /commoncrawl3_truncated.
Step 4 – Comparison of Contents
Top 20 "container" file mimes:
Mime |
Count |
application/pdf |
528,617 |
text/plain; charset=ISO-8859-1 |
184,019 |
application/msword |
78,210 |
application/vnd.openxmlformats-officedocument.wordprocessingml.document |
75,739 |
text/html; charset=UTF-8 |
75,156 |
text/plain; charset=windows-1252 |
74,144 |
text/plain; charset=UTF-8 |
56,462 |
application/octet-stream |
54,278 |
application/zip |
44,989 |
application/rss+xml |
34,213 |
image/jpeg |
30,968 |
application/atom+xml |
28,934 |
image/png |
28,173 |
text/html; charset=windows-1252 |
26,232 |
application/xhtml+xml; charset=UTF-8 |
25,130 |
text/html; charset=ISO-8859-1 |
24,515 |
application/vnd.google-earth.kml+xml |
23,391 |
application/xhtml+xml; charset=windows-1252 |
22,304 |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
22,084 |
application/rtf |
21,811 |
Top 20 Languages (including embedded files) as identified by language id:
Language |
Number of Files |
en |
1,803,350 |
null |
242,442 |
ru |
155,934 |
de |
109,953 |
fr |
96,192 |
it |
73,781 |
es |
59,069 |
ja |
50,941 |
pl |
47,044 |
pt |
35,490 |
ko |
35,251 |
ca |
30,717 |
fa |
26,202 |
zh-cn |
25,379 |
nl |
23,554 |
ro |
23,259 |
tr |
23,111 |
da |
21,967 |
br |
21,420 |
vi |
19,305 |
Code Coverage Metrics
Tobias Ospelt and Rohan Padhye (the author of https://github.com/rohanpadhye/jqf) both noted on our dev list that we could use coverage analysis to identify a minimal corpus that would cover as much of our code base as possible. Obviously, a minimal corpus designed for our current codebase would not be guaranteed to cover new features, and we'd want to leave plenty of extra files around in the hope that some of them would capture new code paths.
Nevertheless, if we could use jqf or another tool to reduce the corpus, that would help make our runs more efficient.
On TIKA-2750, Tobias reported that his experiment with afl-cmin.py showed that it would take roughly four months on our single VM just to create traces (~300 files per hour).
Other Resources
See ComparisonTikaAndPDFToText201811 for notes on a comparison of the output of pdftotext and Tika.