Apache Tika's Html Encoding Study
In support of TIKA-2038, we gathered a new subset of html pages from CC-MAIN-2017-04.
This page offers a first rough draft of the process. Some of the code is available on a personal github site. This code relies heavily on Dominik Stadler's CommonCrawlDocumentDownload code, and the author of SimpleCommonCrawlExtractor is extremely grateful to Dominik.
- Determined which top level domains (TLDs) were of interest 2. Downloaded the 300 index files from Common Crawl via Groovy (217 GB of data):
3.#3 Counted the number of pages per TLD that had "html/text" in the http Content-Type header
4.#4 Created sampling frequencies per TLD, with a target of 50k per TLD, with the exception of 100k for ".com" – this was done by loading mime_tld_total.txt into a database and doing some
group by queries. See tld_mimes.txt.
5. Randomly sampled according to the sampling frequencies per TLD from the 300 index files
6.#6 Pulled the data from Common Crawl