Pushpull is deprecated... see

Introduction

OODT's PushPull component framework provides a client architecture for accessing an array of remote resources. This component is used to pull from remote resources and push to local ones. It is typically used in conjunction with the CAS Crawler component. An example use case would be pulling data products from a remote FTP service and pushing them to a local staging area from which the CAS Crawler then then inject them into the File Manager.

Download and Install

  1. Download a Released tarball/zip from the Downloads page. (http://oodt.apache.org/download)
  2. Uncompress it
  3. cd into the apache-oodt-{version} folder
  4. mvn package

Now the required Maven artifacts have been downloaded and installed into your local maven m2 repo. Time for deployment to your local machine or to another server.

Deployment

  1. cd into pushpull/target
  2. Copy the tarball (cas-pushpull-{version}-dist.tar.gz) to your deployment location
  3. untar the tarball and you will have folder named cas-pushpull-{version} with the following directory structure

/bin /etc /lib /logs /policy

Configuration

Basic Configuration

This is a set of configuration that must be completed to get the Push/Pull framework setup. These setups are required for even the most basic installations. We will cover deployment specific setup/configuration in the next section.

This documentation has been written assuming the environment variable CAS_PP_HOME has been set to the directory where you have untar'd the pushpull component. Several configuration properties require a full file path. Just be sure to replace the CAS_PP_HOME with a value that is applicable to your deployment, or export that environment variable and use the following config.

The following Sub-Sections will reference the path to each file that needs to be edited, and each file will be followed by a block showing what changes need to be made

[CAS_PP_HOME]/etc/push_pull_framework.properties

line
21   #external configuration files
22   org.apache.oodt.cas.pushpull.config.external.properties.files=[CAS_PP_HOME]/etc/default.properties

35   # ingester filemgr url
36   org.apache.oodt.cas.filemgr.url=

61   #protocolfactory specification for protocol types
62   org.apache.oodt.cas.pushpull.config.protocolfactory.info.files=[CAS_PP_HOME]/policy/ProtocolFactoryInfo.xml

69   #parser to retrievalmethod map
70   org.apache.oodt.cas.pushpull.config.parser.info.files=[CAS_PP_HOME]/policy/ParserToRetrievalMethodMap.xml
71
72   #unique metadata element info
73   org.apache.oodt.cas.pushpull.config.type.detection.file=[CAS_PP_HOME]/policy/mimetypes.xml
74
75   #directory below which all data file will be downloaded to
76   org.apache.oodt.cas.pushpull.data.files.base.staging.area=[CAS_PP_HOME]/staging

Specific Configuration(s)

Due to the limitless combinations of protocols and remote data archives the following list of example is NOT exhaustive and is intended to give you working examples. Each configuration will begin with a summary description of the problem being solved, then it will be followed with config/setups needed to solve the problem.

Examples

Example of Connecting to a Remote FTP Server to Retrieve All *.he5 Files

Connection Protocol: FTP
Root Path: ftp://l4ftl01.larc.nasa.gov/TES/TL2CO2N.005/
Password Required: NO
Download (All or Subset)?: All

Examples of full path to where the data resides on the FTP server:

ftp://l4ftl01.larc.nasa.gov/TES/TL2CO2N.005/2004.09.20/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5
ftp://l4ftl01.larc.nasa.gov/TES/TL2CO2N.005/2005.05.21/TES-Aura_L2-CO2-Nadir_r0000002931_F06_08.he5

[CAS_PP_HOME]/policy/mimetypes.xml

Within the mimetypes.xml file we need to map a filename pattern (regex or not) to a custom mimetype. Below we have 3 mimetypes, the first 2 are default in pushpull the 3rd is a custom one based on the filenaming of our desired HDF-5 remote files.

<mime-info>
    <mime-type type="metadata/cas_pushpull">
        <glob pattern="*.info.tmp"/>
    </mime-type>
    <mime-type type="metadata/cas_metadata">
        <glob pattern="*.cas"/>
        <glob pattern="*.met"/>
    </mime-type>
    <mime-type type="product/TESLevel2CO2">
        <_comment>Level 2 - CO2 Retrivals from TES</_comment>
        <glob pattern="TES-Aura_L2-CO2-Nadir_r\d{10}\w{2}\d{2}\w\d{2}\.he5" isregex="true"/>
    </mime-type>
</mime-info>

[CAS_PP_HOME]/etc/examples/ExternalSourcesFiles/ExternalSources.xml

Purpose: This file contains a list of External Data Sources such as FTP Servers. The login.alias attribute will be used within the RemoteSpecs.xml file. This file is located in the etc/examples folder and contains several great examples that you can tailor to your application. I have removed all un-used ExternalSources to make sure I don't go download files I don't want. The source.host doesn't contain the URI prefix (ftp://, http://) and there is NO trailing slash. The login.type takes care of the prefix.

<sources>
    <source host="l4ftl01.larc.nasa.gov">
        <login type="ftp" alias="TESL2CO2">
            <username>anonymous</username>
            <password>user@host.com</password>
        </login>
    </source>
</sources>

[CAS_PP_HOME]/etc/examples/RemoteSpecsFiles/RemoteSpecs.xml

Purpose: This file will first reference the aliases listed in the ExternalSources.xml file from the previous section. Then you can define one or more daemons. The daemon.alias must be listed in the ExternalSources.xml so the daemon will know where it should look for files. The propInfo and propFiles tell the daemon exactly what directories and files to retrieve. We will need to create an xml file called TESL2CO2.xml and place it in the propInfo.dir location. For simplicity I have kept the alias, propFiles and staging area the same (TESL2CO2).  The period attribute on the runInfo tag is used to set the sleep/wait time for the daemon.  Default in 3 minutes, but you may want to adjust this later in production.

<remoteSpecs>
    <aliasSpecs>
        <aliasSpec file="[CAS_PP_HOME]/etc/examples/ExternalSources/ExternalSources.xml"/>
    </aliasSpecs>

    <daemons>
        <daemon alias="TESL2CO2" active="yes">
            <runInfo firstRunDateTime="2011-12-01T00:00:00Z" period="3m" runOnReboot="yes"/>
            <propInfo dir="[CAS_PP_HOME]/etc/examples/DirStructXmlParserFiles">
                <propFiles regExp="TESL2CO2\.xml" parser="org.apache.oodt.cas.pushpull.filerestrictions.parsers.DirStructXmlParser"/>
            </propInfo>
            <dataInfo stagingArea="TESL2CO2" deleteFromServer="no"/>
        </daemon>
    </daemons>
</remoteSpecs>

[CAS_PP_HOME]/etc/examples/DirStructXmlParserFiles/TESL2CO2.xml

Purpose: This file tells pushpull how to parse the remote directory structure. In this example the starting_path is static for all of our remote file paths, but then we have dynamic folders that correspond to a YYYY.MM.DD format so we have a simple regex to pushpull will dig down into each subfolder and will pull out the filename we have declared with another regex.
Within the examples/DirStructXmlParserFiles there are several different examples to learn from.

<root>
    <dirstruct starting_path="/TES/TL2CO2N.005">
        <nofiles/>
        <dir name="\d{4}\.\d{2}\.\d{2}"> <!-- regex matching '2004.09.20' -->
            <nodirs/>
            <!-- regex matching TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5 -->
            <file name="TES-Aura_L2-CO2-Nadir_r\d{10}\w{2}\d{2}\w\d{2}\.he5"/>
        </dir>
    </dirstruct>
</root>

Launching the PushPull Daemon

Located within $CAS_PP_HOME/bin there is a shell script that you can use to launch the PushPull daemon process. You will either need to edit the pushpull file directly to make the proper adjustments or export 2 environment variables. The following steps will assume that we are starting the daemon to run using the configs listed above.

  1. cd $CAS_PP_HOME/bin
  2. The two options listed below:
    1. Export 2 env vars
    2. Replace the CAS_PP_RESOURCES and DAEMONLAUNCHER_PORT with static values

      line
      25   ${JAVA_HOME}/bin/java \
      26   -cp ${LIB_DEPS} -Dcom.sun.management.jmxremote \
      27   -Djava.util.logging.config.file=../etc/logging.properties \
      28   -Djavax.net.ssl.trustStore=${CAS_PP_RESOURCES}/jssecacerts \
      29   org.apache.oodt.cas.pushpull.daemon.DaemonLauncher \
      30   --rmiRegistryPort ${DAEMONLAUNCHER_PORT} \
      31   --propertiesFile ${CAS_PP_RESOURCES}/push_pull_framework.properties \
      32   --remoteSpecsFile ${CAS_PP_RESOURCES}/examples/RemoteSpecsFiles/RemoteSpecs.xml
      
      # You can leave this file unchanged by merely exporting the following env vars (bash shell)
      
      export CAS_PP_RESOURCES=$CAS_PP_HOME/etc
      export DAEMONLAUNCHER_PORT=9012
      
      # Or you can always use this config and not setup env vars
      line
      25   ${JAVA_HOME}/bin/java \
      26   -cp ${LIB_DEPS} -Dcom.sun.management.jmxremote \
      27   -Djava.util.logging.config.file=${CAS_PP_HOME}/etc/logging.properties \
      28   -Djavax.net.ssl.trustStore=${CAS_PP_HOME}/etc/jssecacerts \
      29   org.apache.oodt.cas.pushpull.daemon.DaemonLauncher \
      30   --rmiRegistryPort 9012 \
      31   --propertiesFile ${CAS_PP_HOME}/etc/push_pull_framework.properties \
      32   --remoteSpecsFile ${CAS_PP_HOME}/etc/examples/RemoteSpecsFiles/RemoteSpecs.xml
      
  3. ./pushpull

That should be about it. The daemon should start up on port 9012 (given this config)

FAQ Section

Pushpull keeps re-downloading files I have ingested. How can I prevent PushPull from repeatedly downloading products?

1. You will need to have a fileManager that pushpull can inspect to see if the product has been ingested into the archive.

# ingester filemgr url
org.apache.oodt.cas.filemgr.url=http://localhost:9000

2. Then you just configure the RemoteSpecs.xml file and update the <dataInfo> element and set queryElement="Filename" within the <daemon> block. If you have multiple daemon's configured you will have to configure each one.

<dataInfo stagingArea="MOD09GA-NRT" deleteFromServer="no" queryElement="Filename"/>
</daemon>

No data file is downloaded to my staging directory after running the ./pushpull script. What should I do? 

1. Make sure there are indeed some qualified data files in the remote ftp server.  

2. This may be caused by the protocol issues of the PushPull ftp plugins. So please try the other PushPull ftp plugins. For the details please refer to OODT Push Pull Plugins.