This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

Notes on setting up the Tika Virtual Machine (hosted by Rackspace)

See TIKA-1331

install software

  1. yum install java-1.7.0-openjdk-devel git svn httpd emacs ant

2. curl -s get.gvmtool.net | bash

3. source "/root/.gvm/bin/gvm-init.sh"

4. gvm install groovy

5. mv groovy to /usr/share/groovy

6. added /usr/share/groovy/current/bin to personal path via .bashrc

7. Maven:

  1. wget http://www.eng.lsu.edu/mirrors/apache/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.zip
    2. unzip apache-maven-3.6.0-bin.zip
    3. mv apache-maven-3.6.0/ /opt/maven
    4. ln -s /opt/maven/bin/mvn /usr/bin/mvn
    5. nano /etc/profile.d/maven.sh
    #!/bin/bash
    MAVEN_HOME=/opt/maven
    PATH=$MAVEN_HOME/bin:$PATH
    export PATH MAVEN_HOME
    export CLASSPATH=.
    6. chmod +x /etc/profile.d/maven.sh
    7. . /etc/profile.d/maven.sh

8. ExifTool: follow directions here

config/admin stuff

nano /etc/ssh/sshd_config

  • disallow root to log in
  • add allowed users to AllowUsers

Install, configure and run fail2ban. Thanks to tecmint.

yum install fail2ban

vi /etc/fail2ban/jail.conf

systemctl start fail2ban

opening port 9998 for TIKA-1301

To open port 9998:
firewall-cmd --zone=public --add-port=9998/tcp --permanent

firewall-cmd --reload

permission management

  1. adduser <user> sudo

2. passwd <user>

3. groupadd <admingroup>

4. usermod -g <admingroup> <user>

5. modify /etc/ssh/sshd_config to add user to AllowUsers

6. systemctl restart sshd.service --restart sshd on RHEL 7

mkdirs

/public/corpora/govdocs1 ...

prep govdocs1

  1. cp zipfilelist /public/corpora/govdocs1/archive

2. wget -i zipfilelist.txt

3. (go get some coffee)

4. cd govdocs1/scripts

5. groovy unzip.groovy 0

6. (go get some more coffee)

7. groovy rmBugged.groovy

prep nsfpolardata

  1. scp -r <user>@nsfpolardata.dyndns.org:/usr/local/ndeploy/data/AcadisCrawl .

2. (go get some coffee)

3. scp -r <user>@nsfpolardata.dyndns.org:/usr/local/ndeploy/data/AcadisCrawl2 .

4. (go get some coffee)

5. scp -r <user>@nsfpolardata.dyndns.org:/home/mattmann/polar-data/nutch_trunk/runtime/local/bin/crawlId .

6. (go get some coffee)

7. cd /data1/public/archives/nsf-polar-data/

8. export NUTCH_OPTS="-Xmx8192m -XX:MaxPermSize=8192m"

9. ./bin/nutch dump -outputDir out -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl/segments/

10. ./bin/nutch dump -outputDir out2 -segment /data1/public/archives/nsf-polar-data/acadis/AcadisCrawl2/segments/

11. ./bin/nutch dump -outputDir out3 -segment /data1/public/archives/nsf-polar-data/nasa-amd/crawlId/segments/

add more disc

From Rackspace website, add block storage volume and attach it to server.

  1. mkfs.ext3 /dev/xvdb

2. mkdir /data1

3. mount /dev/xvdb /data1

4. nano /etc/fstab
add these lines (e.g.):
/dev/xvdb /data1 ext3 defaults 1 2

/dev/xvdc /data2 ext3 defaults 1 2

4b. When you wreck the fstab file and can't log into your system after a hard reboot and you are in recovery mode:

blkid to figure out the drive types

mount -t ext3 /dev/xvdb1 /mnt to mount the system

4c. Before you hit 4b, try mount -fav to see if there are any errors in your fstab file.

httpd

config file in the usual place : /etc/httpd/conf/httpd.conf

  1. Set robots.txt to disallow all: /var/www/html

2. Link data dir(s) under: /var/www/html

3. Configure config file to allow links and to show directories

4. Show long file names...add to config file: IndexOptions FancyIndexing SuppressDescription NameWidth=*

5. start: apachectl start

pdftotext

Downloads from: https://www.xpdfreader.com/download.html

Current version: 4.00; Released: 2017 Aug 10

  1. Downloaded 64-bit Linux XpdfReader; executed: XpdfReader-linux64-4.00.01.run; unpacked and cp xpdf to /usr/local/bin

2. Downloaded 64-bit Linux Xpdf tools; unpacked and cp bin64/* to /usr/local/bin

3. Downloaded language support packages: Arabic, Chinese/simplified, Chinese/traditional, Cyrillic, Greek, Hebrew, Japanese, Korean, Latin2, Thai and Turkish; unzipped them all, cat all add-to-xpdfrc >> tmp_xpdfrc and cp all to /usr/local/share/xpdf

4. cat xpdf-tools-linux-4.00/doc/sample-xpdfrc tmp_xpdfrc >> /usr/local/etc/xpdfrc

NOTE: We found that pdftotext was not correctly reading the xpdfrc file in this location. We found no differences in extracted text when we removed the xpdfrc file and when we had it there. We did find a difference, especially in CJK PDFs, when we specified the xpdfrc file from the commandline with the -cfg option.

ffmpeg

  1. sudo yum install epel-release

2. sudo yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm https://download1.rpmfusion.org/nonfree/el/rpmfusion-nonfree-release-7.noarch.rpm

3. sudo yum install ffmpeg ffmpeg-devel

Other data

See ApacheTikaHtmlEncodingStudy for a description of gathering data for TIKA-2038. See CommonCrawl3 for a description of refreshing data for TIKA-2750. See ComparisonTikaAndPDFToText201811 for a comparison of text extracted from PDFs by Apache Tika/Apache PDFBox and pdftotext.


Security

Review bad logins or bad httpd requests:

sudo tail /var/log/secure
sudo vi /var/log/httpd/access_log*
Block ips
sudo vi /etc/hosts.deny
ALL: xxx.xx.xx.xxx
sudo systemctl restart sshd
  • No labels