Introduction
OODT's PushPull component framework provides a client architecture for accessing an array of remote resources. This component is used to pull from remote resources and push to local ones. It is typically used in conjunction with the CAS Crawler component. An example use case would be pulling data products from a remote FTP service and pushing them to a local staging area from which the CAS Crawler then then inject them into the File Manager.
Download and Install
- Download a Released tarball/zip from the Downloads page. (http://oodt.apache.org/download)
- Uncompress it
- cd into the apache-oodt-{version} folder
- mvn package
Now the required Maven artifacts have been downloaded and installed into your local maven m2 repo. Time for deployment to your local machine or to another server.
Deployment
- cd into pushpull/target
- Copy the tarball (cas-pushpull-{version}-dist.tar.gz) to your deployment location
- untar the tarball and you will have folder named cas-pushpull-{version} with the following directory structure
/bin /etc /lib /logs /policy
Configuration
Basic Configuration
This is a set of configuration that must be completed to get the Push/Pull framework setup. These setups are required for even the most basic installations. We will cover deployment specific setup/configuration in the next section.
This documentation has been written assuming the environment variable CAS_PP_HOME has been set to the directory where you have untar'd the pushpull component. Several configuration properties require a full file path. Just be sure to replace the CAS_PP_HOME with a value that is applicable to your deployment, or export that environment variable and use the following config.
The following Sub-Sections will reference the path to each file that needs to be edited, and each file will be followed by a block showing what changes need to be made
[CAS_PP_HOME]/etc/push_pull_framework.properties
line 21 #external configuration files 22 org.apache.oodt.cas.pushpull.config.external.properties.files=[CAS_PP_HOME]/etc/default.properties 35 # ingester filemgr url 36 org.apache.oodt.cas.filemgr.url= 61 #protocolfactory specification for protocol types 62 org.apache.oodt.cas.pushpull.config.protocolfactory.info.files=[CAS_PP_HOME]/policy/ProtocolFactoryInfo.xml 69 #parser to retrievalmethod map 70 org.apache.oodt.cas.pushpull.config.parser.info.files=[CAS_PP_HOME]/policy/ParserToRetrievalMethodMap.xml 71 72 #unique metadata element info 73 org.apache.oodt.cas.pushpull.config.type.detection.file=[CAS_PP_HOME]/policy/mimetypes.xml 74 75 #directory below which all data file will be downloaded to 76 org.apache.oodt.cas.pushpull.data.files.base.staging.area=[CAS_PP_HOME]/staging
Specific Configuration(s)
Due to the limitless combinations of protocols and remote data archives the following list of example is NOT exhaustive and is intended to give you working examples. Each configuration will begin with a summary description of the problem being solved, then it will be followed with config/setups needed to solve the problem.
Examples
Example of Connecting to a Remote FTP Server to Retrieve All *.he5 Files
Connection Protocol: FTP
Root Path: ftp://l4ftl01.larc.nasa.gov/TES/TL2CO2N.005/
Password Required: NO
Download (All or Subset)?: All
Examples of full path to where the data resides on the FTP server:
ftp://l4ftl01.larc.nasa.gov/TES/TL2CO2N.005/2004.09.20/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5 ftp://l4ftl01.larc.nasa.gov/TES/TL2CO2N.005/2005.05.21/TES-Aura_L2-CO2-Nadir_r0000002931_F06_08.he5
[CAS_PP_HOME]/policy/mimetypes.xml
Within the mimetypes.xml file we need to map a filename pattern (regex or not) to a custom mimetype. Below we have 3 mimetypes, the first 2 are default in pushpull the 3rd is a custom one based on the filenaming of our desired HDF-5 remote files.
<mime-info> <mime-type type="metadata/cas_pushpull"> <glob pattern="*.info.tmp"/> </mime-type> <mime-type type="metadata/cas_metadata"> <glob pattern="*.cas"/> <glob pattern="*.met"/> </mime-type> <mime-type type="product/TESLevel2CO2"> <_comment>Level 2 - CO2 Retrivals from TES</_comment> <glob pattern="TES-Aura_L2-CO2-Nadir_r\d{10}\w{2}\d{2}\w\d{2}\.he5" isregex="true"/> </mime-type> </mime-info>
[CAS_PP_HOME]/etc/examples/ExternalSourcesFiles/ExternalSources.xml
Purpose: This file contains a list of External Data Sources such as FTP Servers. The login.alias attribute will be used within the RemoteSpecs.xml file. This file is located in the etc/examples folder and contains several great examples that you can tailor to your application. I have removed all un-used ExternalSources to make sure I don't go download files I don't want. The source.host doesn't contain the URI prefix (ftp://, http://) and there is NO trailing slash. The login.type takes care of the prefix.
<sources> <source host="l4ftl01.larc.nasa.gov"> <login type="ftp" alias="TESL2CO2"> <username>anonymous</username> <password>user@host.com</password> </login> </source> </sources>
[CAS_PP_HOME]/etc/examples/RemoteSpecsFiles/RemoteSpecs.xml
Purpose: This file will first reference the aliases listed in the ExternalSources.xml file from the previous section. Then you can define one or more daemons. The daemon.alias must be listed in the ExternalSources.xml so the daemon will know where it should look for files. The propInfo and propFiles tell the daemon exactly what directories and files to retrieve. We will need to create an xml file called TESL2CO2.xml and place it in the propInfo.dir location. For simplicity I have kept the alias, propFiles and staging area the same (TESL2CO2). The period attribute on the runInfo tag is used to set the sleep/wait time for the daemon. Default in 3 minutes, but you may want to adjust this later in production.
<remoteSpecs> <aliasSpecs> <aliasSpec file="[CAS_PP_HOME]/etc/examples/ExternalSources/ExternalSources.xml"/> </aliasSpecs> <daemons> <daemon alias="TESL2CO2" active="yes"> <runInfo firstRunDateTime="2011-12-01T00:00:00Z" period="3m" runOnReboot="yes"/> <propInfo dir="[CAS_PP_HOME]/etc/examples/DirStructXmlParserFiles"> <propFiles regExp="TESL2CO2\.xml" parser="org.apache.oodt.cas.pushpull.filerestrictions.parsers.DirStructXmlParser"/> </propInfo> <dataInfo stagingArea="TESL2CO2" deleteFromServer="no"/> </daemon> </daemons> </remoteSpecs>
[CAS_PP_HOME]/etc/examples/DirStructXmlParserFiles/TESL2CO2.xml
Purpose: This file tells pushpull how to parse the remote directory structure. In this example the starting_path is static for all of our remote file paths, but then we have dynamic folders that correspond to a YYYY.MM.DD format so we have a simple regex to pushpull will dig down into each subfolder and will pull out the filename we have declared with another regex.
Within the examples/DirStructXmlParserFiles there are several different examples to learn from.
<root> <dirstruct starting_path="/TES/TL2CO2N.005"> <nofiles/> <dir name="\d{4}\.\d{2}\.\d{2}"> <!-- regex matching '2004.09.20' --> <nodirs/> <!-- regex matching TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5 --> <file name="TES-Aura_L2-CO2-Nadir_r\d{10}\w{2}\d{2}\w\d{2}\.he5"/> </dir> </dirstruct> </root>
Launching the PushPull Daemon
Located within $CAS_PP_HOME/bin there is a shell script that you can use to launch the PushPull daemon process. You will either need to edit the pushpull file directly to make the proper adjustments or export 2 environment variables. The following steps will assume that we are starting the daemon to run using the configs listed above.
- cd $CAS_PP_HOME/bin
- The two options listed below:
- Export 2 env vars
Replace the CAS_PP_RESOURCES and DAEMONLAUNCHER_PORT with static values
[CAS_PP_HOME]/bin/pushpullline 25 ${JAVA_HOME}/bin/java \ 26 -cp ${LIB_DEPS} -Dcom.sun.management.jmxremote \ 27 -Djava.util.logging.config.file=../etc/logging.properties \ 28 -Djavax.net.ssl.trustStore=${CAS_PP_RESOURCES}/jssecacerts \ 29 org.apache.oodt.cas.pushpull.daemon.DaemonLauncher \ 30 --rmiRegistryPort ${DAEMONLAUNCHER_PORT} \ 31 --propertiesFile ${CAS_PP_RESOURCES}/push_pull_framework.properties \ 32 --remoteSpecsFile ${CAS_PP_RESOURCES}/examples/RemoteSpecsFiles/RemoteSpecs.xml # You can leave this file unchanged by merely exporting the following env vars (bash shell) export CAS_PP_RESOURCES=$CAS_PP_HOME/etc export DAEMONLAUNCHER_PORT=9012 # Or you can always use this config and not setup env vars line 25 ${JAVA_HOME}/bin/java \ 26 -cp ${LIB_DEPS} -Dcom.sun.management.jmxremote \ 27 -Djava.util.logging.config.file=${CAS_PP_HOME}/etc/logging.properties \ 28 -Djavax.net.ssl.trustStore=${CAS_PP_HOME}/etc/jssecacerts \ 29 org.apache.oodt.cas.pushpull.daemon.DaemonLauncher \ 30 --rmiRegistryPort 9012 \ 31 --propertiesFile ${CAS_PP_HOME}/etc/push_pull_framework.properties \ 32 --remoteSpecsFile ${CAS_PP_HOME}/etc/examples/RemoteSpecsFiles/RemoteSpecs.xml
- ./pushpull
That should be about it. The daemon should start up on port 9012 (given this config)
FAQ Section
Pushpull keeps re-downloading files I have ingested. How can I prevent PushPull from repeatedly downloading products?
1. You will need to have a fileManager that pushpull can inspect to see if the product has been ingested into the archive.
# ingester filemgr url org.apache.oodt.cas.filemgr.url=http://localhost:9000
2. Then you just configure the RemoteSpecs.xml file and update the <dataInfo> element and set queryElement="Filename" within the <daemon> block. If you have multiple daemon's configured you will have to configure each one.
<dataInfo stagingArea="MOD09GA-NRT" deleteFromServer="no" queryElement="Filename"/> </daemon>
No data file is downloaded to my staging directory after running the ./pushpull script. What should I do?
1. Make sure there are indeed some qualified data files in the remote ftp server.
2. This may be caused by the protocol issues of the PushPull ftp plugins. So please try the other PushPull ftp plugins. For the details please refer to OODT Push Pull Plugins.
2 Comments
Luca Cinquini
Hi Cameron,
this is an excellent guide, thanks for taking the time to write it. I assume it will become the PushPull User Guide on the OODT Apache site.
I think there is only only piece of information missing: how to start the daemon with the script provided in the bin directory. This will also explain how the DaemonLauncher takes as input the RemoteSpecs.xml file, which in turns references the other two files in the examples directory. At first I was having trouble figuring out how these XML files would be loaded at startup, and it turns out it's from the bin/ script.
thanks again, this is great.
Luca
Cameron Goodale
Hey Luca,
Sorry about the long delay in fixing up the How to Launch the Daemon section, but I just added it in today.
I am not sure where to configure the amount of time the pushpull daemon will sleep and I have been opening files all over the place. If someone else knows off the top of their head, then I will gladly add it to the Launch section. Hopefully this addition will help new users get up to speed.
Thank you for providing your feedback, I appreciate it.