The tika-eval module was initially designed to run locally with an in-memory H2 database.  For collaboration, we've run tika-eval on ~4 million files we've made available on our regression virtual machine.  We're publishing this data via datasette which runs on top of a sqlite database of tika-eval data.

We'll make two types of data available: 1) file profiles (file type, sha256, byte size) and 2) extract profiles (profiles of the text extracted by Tika).

As a first step, we're making available the file profiles.  See below for some useful SQL to run against this dataset: https://corpora.tika.apache.org/datasette.

File Profile

NOTE

As of Nov 2020, there's a bug in datasette that prevents some of the links from working – the links do not include the base_url.  One way to work around this for now is to add /datasette/ into the url.  For example, if you click on 'json', you'll be directed to: https://corpora.tika.apache.org/file_profiles.json?sql=select...  If you insert /datasette/ like so: https://corpora.tika.apache.org/datasette/file_profiles.json?sql=select..., the link works.

As mentioned above, "File Profile" includes the path of the file, the file name, the file extension, the sha256, the length (in bytes) and the file type as identified by Apache Tika and by the linux `file` command.  "File Profile" does not include any of the more interesting data that we can get once we parse the files and then run Extract Profile.

In this first run of File Profile (November 12, 2020), we did not include "container detection" in Apache Tika's detector.  This means that where file might identify a 'doc', 'ppt' or 'xls' file, Tika will identify 'application/x-tika-msoffice'.  We look forward to repopulating the file_profiles table with Tika's container detection.

The following sql relies on some SQLite specific syntax.  Your mileage will vary if you use H2 or another database.

We used 'file version 5.38' and 'Tika 1.24.1'.

Full data:

select FILE_PATH, FILE_NAME, fp.FILE_EXTENSION,
LENGTH, SHA256, tm.mime_string, tf.mime_string
from file_profiles fp
join file_mimes tm on tm.mime_id=fp.tika_mime_id
join file_mimes tf on tf.mime_id=fp.file_mime_id
order by length desc
limit 100

Total Files (4,263,744)

select count(1) from file_profiles

Total Distinct Files (3,726,422)

select count(1) from (select distinct(sha256) from file_profiles)

Top 100 mime types as identified by Tika:

select tm.mime_string, count(1) as cnt
from file_profiles fp
join file_mimes tm on tm.mime_id=fp.tika_mime_id
group by tm.mime_string
order by cnt desc
limit 100

Counts of 100 most common SHA256s

select SHA256, count(1) as cnt
from file_profiles fp
group by SHA256
order by cnt desc
limit 100

List files associated with the 100 most common SHA256s

select shas.sha256, cnt, fp.file_path from
(select
fp.SHA256 as sha256, count(1) as cnt
from file_profiles fp
group by fp.sha256
order by count(1) desc
limit 100
) as shas
join file_profiles fp on shas.sha256=fp.sha256
order by shas.cnt desc, shas.sha256, file_path

Get the file paths for 100 PDFs

select FILE_PATH 
from file_profiles fp
join file_mimes m on m.mime_id=fp.tika_mime_id
where m.mime_string like '%pdf'
limit 100

Get the URLs for 100 PDFs

select
'https://corpora.tika.apache.org/base/docs/'||FILE_PATH as URL
from file_profiles fp
  join file_mimes m on m.mime_id=fp.tika_mime_id
where m.mime_string like '%pdf'
limit 100

I want all files, but I don't want a file if its SHA256 is already in the list

select first_value(file_path) over ( partition by sha256 order by file_path) as path
from file_profiles fp
group by sha256
order by file_path

I want all PDFs, but I don't want a file if its SHA256 is already in the list

select first_value(file_path) over ( partition by sha256 order by file_path) as path
from file_profiles p
join file_mimes m on p.tika_mime_id=mime_id
where mime_string like '%pdf'
group by sha256
order by file_path

Most common differences between Tika and File mime identification

select tm.mime_string as 'tika_mime', fm.mime_string as 'file_mime', count(1) as cnt
from file_profiles fp
join file_mimes tm on fp.tika_mime_id=tm.mime_id
join file_mimes fm on fp.file_mime_id=fm.mime_id
where tm.mime_string <> fm.mime_string
group by tm.mime_string, fm.mime_string
order by cnt desc

Most common file types that 'file' identifies but Tika doesn't

select
fm.mime_string as 'file_mime', count(1) as cnt
from
file_profiles fp
join file_mimes tm on fp.tika_mime_id = tm.mime_id
join file_mimes fm on fp.file_mime_id = fm.mime_id
where
tm.mime_string <> fm.mime_string
and (
tm.mime_string is null
or length(tm.mime_string) == 0
or tm.mime_string = 'application/octet-stream'
)
group by m.mime_string
order by cnt desc

Most common file types that Tika identifies but 'file' doesn't

select tm.mime_string as 'file_mime', count(1) as cnt
from
file_profiles fp
join file_mimes tm on fp.tika_mime_id = tm.mime_id
join file_mimes fm on fp.file_mime_id = fm.mime_id
where
tm.mime_string <> fm.mime_string
and (
fm.mime_string is null
or length(fm.mime_string) == 0
or fm.mime_string = 'application/octet-stream'
)
group by
tm.mime_string
order by cnt desc
  • No labels