This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

Here are some stats contributed by Mark Kerzner and Amanda Towler from Hyperion Gray.

Total number of images to process: about 300,000
Average time per image: about 1 sec
Total run time required: about 10 days
Our run times on various bathes: about 1 day total
OCR quality: decent

Future Work

  • Use Tika, rather than do Tesseract directly
  • Scale it up with Spark or Hadoop
  • A few polishes, with the view on other teams/projects using it later
  • No labels