Introduction
My reason for for writing this article is this is that I am wanting to execute File Manager queries from some python code. The most obvious place to integrate the python code into the File Manager is at the client side of the XML-RPC interface.
I wanted to achieve my goals without needing to delve into the details of the Java code to figure the basics out.
The following is assumed:
- You have File Manager up and running.
- You have a catalog that you can query. Specifically I assume that you've ingested a blah.txt file (for getting a blah.txt file ingested, see OODT Filemgr User Guide).
- You have tcpdump (or WireShark) installed.
- You have python, ipython and the python xmlrpclib module installed (xmlrpclib should come as part of the default python installation).
Getting Started
It's possible to figure out what is happening on the interface by looking at content of the tcp packets.
So I wanted to dump the tcp packets to a file while the QueryTool (or its ilk) are communicating with the File Manager.
The sequence to capture the tcp packets is as follows:
- Start the File Manager
- Start tcpdump and listen on the loop back interface (lo0) (you may need to tweak the tcpdump command to filter out other traffic).
- Execute the query_tool command
- Stop the tcpdump capture
- Analyse the tcp dump
Here's the command sequence I executed:
$ cd /usr/local/oodt/cas-filemgr/bin $ ./filemgr start $ cd ${HOME} $ sudo tcpdump -pnXs0 -i lo0 -w tcpdump.output # Run the query - see the query command below $ ^C
The query command that I executed:
$ ./query_tool --url http://localhost:9000 --sql \ -query "SELECT CAS.ProductReceivedTime,CAS.ProductName,CAS.ProductId,ProductType,\ ProductStructure,Filename,FileLocation,MimeType \ FROM GenericFile WHERE Filename='blah.txt'" -sortBy 'CAS.ProductReceivedTime' \ -outputFormat '$CAS.ProductReceivedTime,$CAS.ProductName,$CAS.ProductId,$ProductType,\ $ProductStructure,$Filename,$FileLocation,$MimeType'
Analysing the TCP dump
By looking at the captured packets, I was able to discover the methods that where called on the interface. As according to the XML-RPC specifications, method calls have a methodCall element and method responses have a methodResponse element. Since I'm interested in the methodCall, by running the following command I found all the <methodCall> </methodCall> elements and their content:
strings tcpdump.out | grep methodCall tcpdump.out
.
There are two methods that are called on the interface:
- filemgr.isAlive
- filemgr.complexQuery
The filemgr.complexQuery method is called with parameters. To figure out the parameters, I copied the xml methodCall element into an editor and tidied up the code a bit. Specifically I used TextMate and its XML bundles Tidy command.
Running a Query from Python
Now to test these queries from an ipython shell.
In python a <struct> element is a dictionary and an <array> element is a list. You can use the xmlrpclib.load() method to decode the xmlrcp methodCall. Be aware the original methodCall captured might contain some backslashes, i.e. '\'. You'll need to delete these backslash in a text editor before you paste in the xml string as a parameter to the load method.
Here is my ipython session and the commands that I executed:
In [1]: import xmlrpclib In [2]: fmrpc = xmlrpclib.ServerProxy('http://localhost:9000') In [3]: fmrpc.filemgr.isAlive() Out[3]: True In [4]: query = {} In [5]: query['reducedMetadata'] = ['CAS.ProductReceivedTime', 'CAS.ProductName', 'CAS.ProductId', 'ProductType', 'ProductStructure', 'Filename', 'FileLocation', 'MimeType'] In [6]: query['reducedProductTypeNames'] = ['GenericFile'] In [7]: query['sortByMetKey'] = 'CAS.ProductReceivedTime' In [8]: query['toStringResultFormat'] = '$CAS.ProductReceivedTime,$CAS.ProductName,$CAS.ProductId,$ProductType,$ProductStructure ,$Filename,$FileLocation,$MimeType' In [9]: query['criteria'] = [{'elementName': 'Filename', 'class': 'org.apache.oodt.cas.filemgr.structs.TermQueryCriteria', 'elementValue': 'blah.txt'}] In [10]: fmrpc.filemgr.complexQuery(query) Out[10]: [{'metadata': {'CAS.ProductId': ['a00616c6-f0c2-11e0-baf4-65c684787732'], 'CAS.ProductName': ['blah.txt'], 'CAS.ProductReceivedTime': ['2011-10-07T10:59:12.031+02:00'], 'FileLocation': ['/var/kat/archive/data/blah.txt'], 'Filename': ['blah.txt'], 'MimeType': ['text/plain'], 'ProductStructure': ['Flat'], 'ProductType': ['GenericFile']}, 'product': {'id': 'a00616c6-f0c2-11e0-baf4-65c684787732', 'name': 'blah.txt', 'references': [], 'structure': 'Flat', 'transferStatus': 'RECEIVED', 'type': {'description': 'The default product type for any kind of file.', 'id': 'urn:oodt:GenericFile', 'name': 'GenericFile', 'repositoryPath': 'file:///var/kat/archive/data', 'typeExtractors': [{'className': 'org.apache.oodt.cas.filemgr.metadata.extractors.CoreMetExtractor', 'config': {'elementNs': 'CAS', 'elements': 'ProductReceivedTime,ProductName,ProductId', 'nsAware': 'true'}}, {'className': 'org.apache.oodt.cas.filemgr.metadata.extractors.examples.MimeTypeExtractor', 'config': {}}, {'className': 'org.apache.oodt.cas.filemgr.metadata.extractors.examples.FinalFileLocationExtractor', 'config': {'replace': 'true'}}], 'typeHandlers': [], 'typeMetadata': {'ProductType': ['GenericFile']}, 'versionerClass': 'org.apache.oodt.cas.filemgr.versioning.BasicVersioner'}}, 'toStringFormat': '$CAS.ProductReceivedTime,$CAS.ProductName,$CAS.ProductId,$ProductType, $ProductStructure,$Filename,$FileLocation,$MimeType'}] In [11]: