Notes on setting up the Tika Virtual Machine
See TIKA-1331
Install software (this has been updated for Ubuntu)
- sudo apt update
- gpg
- sudo apt install gnupg
- java
wget -qO - https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | sudo apt-key add -
sudo apt-get install -y software-properties-common
sudo add-apt-repository --yes https://adoptopenjdk.jfrog.io/adoptopenjdk/deb/
sudo apt-get install adoptopenjdk-8-hotspot
sudo apt-get install adoptopenjdk-11-hotspot
sudo apt-get install adoptopenjdk-14-hotspot
sudo apt-get install fontconfig (https://github.com/AdoptOpenJDK/openjdk-build/issues/693 via Dominik Stadler)
sudo apt install ttf-dejavu (same as above)
- sudo apt-get install groovy
- sudo apt-get install maven
- sudo apt-get install subversion
- sudo apt-get install git
- sudo apt-get install file
- installed docker following: https://docs.docker.com/engine/install/ubuntu/
Datasette
On 12 November 2020, I ran tika-eval's new FileProfile on the corpus. This includes file type detection by Tika and by 'file', digests and file sizes.
We configured the reverse proxy for /datasette:
ProxyPreserveHost On
ProxyPass /datasette http://0.0.0.0:8001
ProxyPassReverse /datasette http://0.0.0.0:8001
The .db is in /data1/publish. cd to that directory and then:
docker run --name datasette -d -p 8001:8001 -v `pwd`:/mnt datasetteproject/datasette datasette -p 8001 -h 0.0.0.0 /mnt/file_profiles.db --config sql_time_limit_ms:120000 --config max_returned_rows:100000 --config base_url:/datasette/
HTTPD
/etc/apache2
public directories are symlinks in /usr/share/corpora
Everything below here needs to be updated for Ubuntu
config/admin stuff
nano /etc/ssh/sshd_config
- disallow root to log in
- add allowed users to AllowUsers
Install, configure and run fail2ban. Thanks to tecmint.
yum install fail2ban
vi /etc/fail2ban/jail.conf
systemctl start fail2ban
opening port 9998 for TIKA-1301
To open port 9998:firewall-cmd --zone=public --add-port=9998/tcp --permanent
firewall-cmd --reload
permission management
adduser <user> sudo
2. passwd <user>
3. groupadd <admingroup>
4. usermod -g <admingroup> <user>
5. modify /etc/ssh/sshd_config to add user to AllowUsers
6. systemctl restart sshd.service
--restart sshd on RHEL 7
mkdirs
/public/corpora/govdocs1 ...
prep govdocs1
cp zipfilelist /public/corpora/govdocs1/archive
2. wget -i zipfilelist.txt
3. (go get some coffee)
4. cd govdocs1/scripts
5. groovy unzip.groovy 0
6. (go get some more coffee)
7. groovy rmBugged.groovy
prep nsfpolardata
scp -r <user>@nsfpolardata.dyndns.org:/usr/local/ndeploy/data/AcadisCrawl .
2. (go get some coffee)
3. scp -r <user>@nsfpolardata.dyndns.org:/usr/local/ndeploy/data/AcadisCrawl2 .
4. (go get some coffee)
5. scp -r <user>@nsfpolardata.dyndns.org:/home/mattmann/polar-data/nutch_trunk/runtime/local/bin/crawlId .
6. (go get some coffee)
7. cd /data1/public/archives/nsf-polar-data/
8. export NUTCH_OPTS="-Xmx8192m -XX:MaxPermSize=8192m"
9. ./bin/nutch dump -outputDir out -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl/segments/
10. ./bin/nutch dump -outputDir out2 -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl2/segments/
11. ./bin/nutch dump -outputDir out3 -segment /data1/public/archives/nsf-polar-data/nasa-amd/crawlId/segments/
add more disc
From Rackspace website, add block storage volume and attach it to server.
mkfs.ext3 /dev/xvdb
2. mkdir /data1
3. mount /dev/xvdb /data1
4. nano /etc/fstab
add these lines (e.g.):/dev/xvdb /data1 ext3 defaults 1 2
/dev/xvdc /data2 ext3 defaults 1 2
4b. When you wreck the fstab file and can't log into your system after a hard reboot and you are in recovery mode:
blkid
to figure out the drive types
mount -t ext3 /dev/xvdb1 /mnt
to mount the system
4c. Before you hit 4b, try mount -fav
to see if there are any errors in your fstab file.
httpd
config file in the usual place : /etc/httpd/conf/httpd.conf
- Set robots.txt to disallow all: /var/www/html
2. Link data dir(s) under: /var/www/html
3. Configure config file to allow links and to show directories
4. Show long file names...add to config file: IndexOptions FancyIndexing SuppressDescription NameWidth=*
5. start: apachectl start
pdftotext
Downloads from: https://www.xpdfreader.com/download.html
Current version: 4.00; Released: 2017 Aug 10
- Downloaded 64-bit Linux
XpdfReader
; executed:XpdfReader-linux64-4.00.01.run
; unpacked and cp xpdf to /usr/local/bin
2. Downloaded 64-bit Linux Xpdf tools
; unpacked and cp bin64/* to /usr/local/bin
3. Downloaded language support packages: Arabic, Chinese/simplified, Chinese/traditional, Cyrillic, Greek, Hebrew, Japanese, Korean, Latin2, Thai and Turkish; unzipped them all, cat all add-to-xpdfrc >> tmp_xpdfrc and cp all to /usr/local/share/xpdf
4. cat xpdf-tools-linux-4.00/doc/sample-xpdfrc tmp_xpdfrc >> /usr/local/etc/xpdfrc
NOTE: We found that pdftotext was not correctly reading the xpdfrc file in this location. We found no differences in extracted text when we removed the xpdfrc file and when we had it there. We did find a difference, especially in CJK PDFs, when we specified the xpdfrc file from the commandline with the -cfg option.
ffmpeg
sudo yum install epel-release
2. sudo yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm https://download1.rpmfusion.org/nonfree/el/rpmfusion-nonfree-release-7.noarch.rpm
3. sudo yum install ffmpeg ffmpeg-devel
Other data
See ApacheTikaHtmlEncodingStudy for a description of gathering data for TIKA-2038. See CommonCrawl3 for a description of refreshing data for TIKA-2750. See ComparisonTikaAndPDFToText201811 for a comparison of text extracted from PDFs by Apache Tika/Apache PDFBox and pdftotext.
Security
Review bad logins or bad httpd requests:
sudo tail /var/log/secure
sudo vi /var/log/httpd/access_log*
Block ips
sudo vi /etc/hosts.deny
ALL: xxx.xx.xx.xxx
sudo systemctl restart sshd