How to Write an Output Connector - Apache Connectors Framework

Farzad

I'd like to create an output connector where I can detect duplicate files (documents) in the repository. My thought is to store a hashsum value for each document, SHA-512, and store that in a table and set it as a primary key. Subsequent insert with the same value would be rejected indicating a duplicate.

I read connectors are allowed to have their own tables. What would the table name be? I issue SQL over JDBC to modify table data? How do you establish a connection to the db? Any other thoughts about my approach?

Thanks!

Permalink

Karl Wright
The best answer I can give right at the moment is to look at what the web connector does for tables. You will find that it creates two of them. Follow the same conventions and you should be OK. Naming is up to you, but obviously a good practice is to use a prefix that is likely to keep your table name from colliding with other tables from other connectors, or from future versions of the framework. If or when there is a ManifoldCF book, be assured that this topic will be covered in depth.
- Permalink
- Oct 13, 2010
- Delete comments
1. Farzad
  Thanks for the pointer, didn't know I'd be naming them. Any info on setting up the project in Eclipse to build out connectors? I tried using the build file and it didn't work. I also did it a project checkout and that just created a general project. Should I just pull the source I need and resolve linkage issues, is that the way?
  Permalink
  
  Oct 13, 2010
  
  Delete comments
  1. Karl Wright
    You are on your own as far as Eclipse is concerned. The conditional nature of connector builds does not readily lend itself to an Eclipse-style single project.
    
    One route would be to add your connector to the main build.xml, which will then automatically register it in connectors.xml in the example. If you tried this already I am puzzled as to why it did not work; it's the standard way new connectors get added.
    
    Permalink
    
    Oct 13, 2010
    
    Delete comments
    1. Farzad
      
      I'm just getting started to write an output connector. I've got nothing to add yet. I'll interogate the build.xml file to figure out how to setup the source relationships in Eclipse.
      
      Permalink
      
      Oct 13, 2010
      
      Delete comments

Farzad

Confused a bit with NullOutputConnector source. So it seems, the only three methods you need to implement are: addOrReplaceDocument, getOutputDescription, and removeDocument. I see a method called getSession that does nothing and it is being called in addOrReplaceDocument and removeDocument methods of NullOutputConnector. Is that just empty code? Is it save to assume, connect, disconnect are also not needed based on getSession? Was my assumption true about needing to implement only the three methods. At this point, I just want to create code that identifies duplicates. Not worried about security just yet.

Permalink

Karl Wright
getSession() is empty code only in the null output connector. Normally that's where you'd set up your connection to your output target server. The code was left structured that way to demonstrate the proper way of setting up connection sessions. You can't do the actual connection in the connect() method because (as you may or may not have noticed), there is no way to throw an exception from that method.

If you are simply identifying duplicates, then no session is needed, and you can remove connect(), disconnect(), getSession(), etc.
- Permalink
- Oct 14, 2010
- Delete comments
1. Farzad
  So I managed to get my connector built and running. The goals was to create a table in the database and see it populate with some unique string, my name I encountered two problems, the job never ends, my file set was 9. It said 9 processed, but also 6 active, and never showed an end date. Also I thought I created a table in the db called "filedata", but when I browse the database with phpPgAdmin, I don't see that under Schemas\public\Tables. Any thoughts? I will admit I'm struggling to grasp the flow. If there are other reading materials or links you think will remove the fog, please share.
  
  I upload my dupfinder connector source directory, as well as the modules\build.xml file to http://www.farzad.net/manifoldcf/.
  Permalink
  
  Oct 14, 2010
  
  Delete comments
  1. Karl Wright
    I took a quick glance. In order for your table to get created, you need to implement the install() and uninstall() methods of the connector. See the web connector for an example. These methods should call your appropriate table manager methods.
    
    BEFORE you do this you want to unregister your output connector using the appropriate "unregisteroutput" command via shell script. If you don't then when you start up Manifold you will get an exception, because it's trying to tear down a table that doesn't yet exist. It is at registration time that the install() method will be called.
    
    Karl
    
    Permalink
    
    Oct 14, 2010
    
    Delete comments
    1. Farzad
      
      I got an error trying to unregister, I posted that under "How to build and deploy ManifoldCF" where it talks about issuing commands.
      
      I got around the problem by dropping the table and doing a clean build. Now the table is created, however it is empty, no rows. dataManager.insertData is getting called inside addOrReplaceDocument.
      
      The job status says Documents=9, Active=6, and Processed=9 and the job doesn't end. I am returning DOCUMENTSTATUS_ACCEPTED. I start the job manually.
      
      I traced the calls and I see where deinstall and install are called initially. When I start the job, for some reason addOrReplaceDocument is called 45 times, and getOutputDescription is called 48 times. I only have 9 docs, why so many times?
      
      Permalink
      
      Oct 15, 2010
      
      Delete comments
      1. Karl Wright
        
        addOrReplaceDocument is only called when a document is indexed. But if your code throws an exception, the framework code will repeat the process. I suggest you check the output log to see the exception.
        
        Permalink
        
        Oct 15, 2010
        
        Delete comments
        
        Farzad
        
        You were right. My dataManager was not initialized, so it was throwing NullPointer exceptions. After that I had a primary key violation, and I liked how the UI reported that back. Finally, I was using a counter for the primary key and needed to make it static because of the multiple instances of the connector, so that it incremented correctly.
        
        One interesting thing I learnt was that MCF doesn't reprocess the same item twice. I had a failed job that processed two item before stopping. Those two items were not sent again to newer jobs for the same file system path. Only after I deleted the jobs, and created a new one that they were sent again.
        
        Permalink
        
        Oct 18, 2010
        
        Delete comments
        
        Karl Wright
        
        Yes, that's because it is an incremental crawler. You can make it resend everything to your output connection by clicking the link in the output connection's view screen titled something like "recrawl everything for this connection" or some such.
        
        Permalink
        
        Oct 19, 2010
        
        Delete comments

Farzad

Error definging multiple primary keys. So I'm using the org.apache.manifoldcf.core.interfaces.ColumnDescription class to setup the db tables. If I specify two columns to be primary keys then I get the following error. It makes it sound like I can have multiple keys if I define the table correctly, ie. for table XX not allowed, as if allowed for table YY. I didn't see any documentation to help me understand it. Any ideas?

C:\Program Files\Apache\apache-acf\dupfinder\example>java -jar start.jar >> output.txt
Configuration file successfully read
Successfully unregistered all output connectors
Successfully unregistered all authority connectors
Successfully unregistered all repository connectors
Successfully registered output connector 'org.apache.manifoldcf.agents.output.solr.SolrConnector'
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: multiple primary keys for table "filedata" are not allowed
at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:449)
at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performCreate(DBInterfacePostgreSQL.java:255)
at org.apache.manifoldcf.core.database.BaseTable.performCreate(BaseTable.java:111)
at org.apache.manifoldcf.agents.output.dupfinder.DataManager.install(DataManager.java:57)
at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.install(DupFinderConnector.java:98)
at org.apache.manifoldcf.agents.interfaces.OutputConnectorFactory.install(OutputConnectorFactory.java:51)
at org.apache.manifoldcf.agents.outputconnmgr.OutputConnectorManager.registerConnector(OutputConnectorManager.java:169)
at org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:348)
Caused by: org.postgresql.util.PSQLException: ERROR: multiple primary keys for table "filedata" are not allowed
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:337)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:329)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:526)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)

C:\Program Files\Apache\apache-acf\dupfinder\example>

Permalink

Farzad
I was looking at the code for DBInterfacePostgreSQL in org.apache.manifoldcf.core.database and looks like it is not setup to handle multiple primary keys. Aren't primary keys suppose to be listed together as a comma separated list?

protected static void appendDescription(StringBuffer queryBuffer, String columnName, ColumnDescription cd, boolean forceNull)
{
queryBuffer.append(columnName);
queryBuffer.append(' ');
queryBuffer.append(mapType(cd.getTypeString()));
if (forceNull || cd.getIsNull())
queryBuffer.append(" NULL");
else
queryBuffer.append(" NOT NULL");
if (cd.getIsPrimaryKey())
queryBuffer.append(" PRIMARY KEY");
if (cd.getReferenceTable() != null)

Unknown macro: { queryBuffer.append(" REFERENCES "); queryBuffer.append(cd.getReferenceTable()); queryBuffer.append('('); queryBuffer.append(cd.getReferenceColumn()); queryBuffer.append(") ON DELETE"); if (cd.getReferenceCascade()) queryBuffer.append(" CASCADE"); else queryBuffer.append(" RESTRICT"); }

}
- Permalink
- Oct 18, 2010
- Delete comments
1. Karl Wright
  The ManifoldCF database abstraction is, of course, a limited one. You can't do everything that you can do in every kind of database there is out there. However, you *can* effectively define the equivalent of a multiple primary key relationship simply by doing the following:
  
  (1) create your columns, labeling NONE of them as "primary key".
  
  (2) create a unique index on the columns you want to be your multiple primary key.
  
  In PostgreSQL this does the same thing as a multiple primary key.
  Permalink
  
  Oct 19, 2010
  
  Delete comments
  1. Farzad
    Having the index would not allow me to use the insert operation to detect the dups. The algorithm is using two columns one for the hashsum, and another for the dup number, where it is incremented with each duplicate. The algorithm would:
    
    1) attempt an insert with hashsum X and dupnum = 1
    2) if I get 23505 SQL exception, then I'll find the largest dup number, and try another insert with one plus the largest.
    
    What do you think of this? Noticed there is a method already called performAlter, however it only operates on columns, add, delete, etc. What do you think if I added another method to allow adding primary key constraints, allow to perform this alter "ALTER TABLE mytable ADD CONSTRAINT _pk PRIMARY KEY (col1, col2, ...).
    
    Permalink
    
    Oct 20, 2010
    
    Delete comments
    1. Karl Wright
      
      Like I said before, creating a unique index creates exactly the same Postgresql constraints as having multiple primary keys. The same (or a very similar) SQL error is returned when the constraint is violated.
      
      Permalink
      
      Oct 21, 2010
      
      Delete comments
      1. Farzad
        
        You were right : ) Here is what you get, when I added:
        
        ArrayList list = new ArrayList();
        list.add(hashsum);
        list.add(dupnum);
        addTableIndex(true, list);
        
        Error: ERROR: duplicate key value violates unique constraint "i1287662986761" Detail: Key (hashsum, dupnum)=(C46875547F6B97BAC41132F6F8A057CC10060FBC69B5B26428D6D561E00AE1F1B1C8BBD5664FC4C94E95A5AC31045C3EAA8AE11DB19A697CC410F3EC9E233D38, 1) already exists.
        
        So I don't see how to isolate the primary key violation. All I see in ManifoldCFException is DATABASE_TRANSACTION_ABORT, which is what I get when I call getErrorCode(). How do I know I have a primary key violation vs other transaction errors?
        
        Permalink
        
        Oct 21, 2010
        
        Delete comments
        
        Karl Wright
        
        The only other transaction error that DATABASE_TRANSACTION_ABORT can represent is deadlock. So unless you are in a transaction you should not see that one.
        
        Permalink
        
        Oct 21, 2010
        
        Delete comments
        
        Farzad
        
        I got it working, thanks. A problem that took a lot of time was the column names. I defined all mine in upper case. For some reason, they are converted to lower case. I was getting an exception thrown, because I was looking for a column value using the original handle.
        
        So is this a bug? If not, I think it should be documented to use lowercase only for db column names.
        
        Another problem I hit was perfroming a query. It seems that I can only pass the value using an array list. Here is my first call that didn't work.
        
        IResultSet result = performQuery("SELECT * FROM " + getTableName() + " WHERE " + hashsum + "=" + hashsumVal, null, null, null);
        
        It gives an error the column name doesn't exist, where it is using the hashsumVal as the column name. See below. I then used the ArrayList method and it worked. Is there something I did wrong in my first attempt?
        
        ArrayList list = new ArrayList();
        list.add(hashsumVal);
        IResultSet result = performQuery("SELECT * FROM " + getTableName() + " WHERE " + hashsum + "=?", list, null, null);
        
        Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: column "c46875547f6b97bac41132f6f8a057cc10060fbc69b5b26428d6d561e00ae1f" does not exist
        org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: column "c46875547f6b97bac41132f6f8a057cc10060fbc69b5b26428d6d561e00ae1f" does not exist
        at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
        at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:465)
        at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
        at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
        at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
        at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:754)
        at org.apache.manifoldcf.core.database.BaseTable.performQuery(BaseTable.java:229)
        at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:116)
        at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:74)
        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424)
        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409)
        at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304)
        at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586)
        at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
        at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
        Caused by: org.postgresql.util.PSQLException: ERROR: column "c46875547f6b97bac41132f6f8a057cc10060fbc69b5b26428d6d561e00ae1f" does not exist
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:337)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:329)
        at org.apache.manifoldcf.core.database.Database.execute(Database.java:526)
        at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)
        
        Permalink
        
        Oct 21, 2010
        
        Delete comments

Farzad

When is the currentContext handle in BaseConnector valid? I tried to add an ID object in the constructor, so I can track the debug statements and the example wouldn't even start.

Object id = currentContext.get("id");
if (id == null) {
	currentContext.save("id", new Integer(idNum));
	idNum++;
}

Here is the error message

Configuration file successfully read
Exception in thread "main" java.lang.ClassCastException: java.lang.NullPointerException cannot be cast to org.apache.manifoldcf.core.interfaces.ManifoldCFException
at org.apache.manifoldcf.agents.interfaces.OutputConnectorFactory.getConnectorNoCheck(OutputConnectorFactory.java:149)
at org.apache.manifoldcf.agents.interfaces.OutputConnectorFactory.deinstall(OutputConnectorFactory.java:60)
at org.apache.manifoldcf.agents.outputconnmgr.OutputConnectorManager.unregisterConnector(OutputConnectorManager.java:199)
at org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:256)

Permalink

Farzad
So I figured out an answer, please verify it is correct. It seems thread context is set when a crawl is happening which makes sense. I overwrote the setThreadContext method and assigned an id. It works, is this the right way?
public void setThreadContext(IThreadContext threadContext) { super.setThreadContext(threadContext); if (threadContext != null) { Object id = currentContext.get("id"); if (id == null) { currentContext.save("id", new Integer(idNum)); idNum++; } System.out.println( Thread.currentThread().getStackTrace()[1].getMethodName() + ", id=" + "[" + currentContext.get("id") + "]"); } }
- Permalink
- Oct 21, 2010
- Delete comments
1. Farzad
  While there are no jobs running, right after a restart, and I click on Job Status in the UI, the framework seems to be calling the setThreadContext and followed by the clearThreadContext every so often. Sometimes, it uses an old thread, sometimes a new thread. Is there something I'm not doing to say the connector is done or not active?
  Permalink
  
  Oct 21, 2010
  
  Delete comments
2. Karl Wright
  The thread context is set whenever the connector instance is grabbed by a thread, and must be forgetten when the connector instance is released by that thread. Furthermore, connector instances are pooled, so there is no guarantee that the same instance will be used for subsequent operations within the same crawl.
  
  If you are trying to use the thread context and finding it to be null, you are by definition using it incorrectly.
  Permalink
  
  Oct 23, 2010
  
  Delete comments

Farzad

Any thoughts on what this means and how it can happen? Thanks!

Thread[Worker thread '62',5,main]: invalidateKeys: 1287700759421: org.apache.manifoldcf.core.cachemanager.CacheManager@104b5ae: 
Transaction hash = {1287700759415=org.apache.manifoldcf.core.cachemanager.CacheManager$CacheTransactionHandle@33c78b}
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Bad transaction ID!
	at org.apache.manifoldcf.core.cachemanager.CacheManager.invalidateKeys(CacheManager.java:613)
	at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:175)
	at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:133)
	at org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:76)
	at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:121)
	at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:78)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586)
	at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
	at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
Exception tossed: Bad transaction ID!
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Bad transaction ID!
	at org.apache.manifoldcf.core.cachemanager.CacheManager.invalidateKeys(CacheManager.java:613)
	at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:175)
	at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:133)
	at org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:76)
	at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:121)
	at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:78)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409)
	at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586)
	at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
	at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)

Permalink

Farzad

This one also pops up for some reason too and then the crawl really gets slow, or in this case MCF just got locked up. Had to stop and restart the service to even terminate the job.

Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: 
current transaction is aborted, commands ignored until end of transaction block
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: current 
transaction is aborted, commands ignored until end of transaction block
 at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
 at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:449)
 at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
 at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
 at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
 at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
 at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:133)
 at org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:76)
 at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:121)
 at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:78)
 at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424)
 at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409)
 at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304)
 at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586)
 at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
 at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
Caused by: org.postgresql.util.PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block
 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
 at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
 at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
 at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
 at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:351)
 at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:305)
 at org.apache.manifoldcf.core.database.Database.execute(Database.java:566)
 at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)

Permalink

Karl Wright
It sounds like you are performing transactions within your connector and not structuring them properly. All transactions MUST be structured using the following paradigm, which you should see in the web connector classes I pointed you towards earlier:

begin transaction

try

{

...

}

catch (any exceptions)

{

signalrollback

rethrow exception

}

finally

{

endtransaction

}
- Permalink
- Oct 23, 2010
- Delete comments
1. Farzad
  Thanks for your answer. I tried that way first and ran into problems. I have to execute multiple txns depending if a dup is found. So I just tried two try catch blocks using a flag, instead of the nested version I first tried. I'm getting these errors now:
  
  Thread[Worker thread '44',5,main]: startTransaction: org.apache.manifoldcf.core.cachemanager.CacheManager@bc83c3: Transaction hash = {} org.apache.manifoldcf.core.interfaces.ManifoldCFException: Illegal parent transaction ID: 1287833203886 at org.apache.manifoldcf.core.cachemanager.CacheManager.startTransaction(CacheManager.java:687) at org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:204) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.beginTransaction(DBInterfacePostgreSQL.java:995) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.beginTransaction(DBInterfacePostgreSQL.java:966) at org.apache.manifoldcf.core.database.BaseTable.beginTransaction(BaseTable.java:258) at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:120) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:78) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585) Thread[Worker thread '21',5,main]: invalidateKeys: 1287833215335: org.apache.manifoldcf.core.cachemanager.CacheManager@1dbe8b: Transaction hash = {1287833215337=org.apache.manifoldcf.core.cachemanager.CacheManager$CacheTransactionHandle@1a7f162, 1287833197104=org.apache.manifoldcf.core.cachemanager.CacheManager$CacheTransactionHandle@c8b88f} org.apache.manifoldcf.core.interfaces.ManifoldCFException: Bad transaction ID! at org.apache.manifoldcf.core.cachemanager.CacheManager.invalidateKeys(CacheManager.java:613) at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:175) at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586) at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performInsert(DBInterfacePostgreSQL.java:133) at org.apache.manifoldcf.core.database.BaseTable.performInsert(BaseTable.java:76) at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:122) at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:78) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409) at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304) at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
  
  Here is the code I'm executing. Also what is invalidate keys used for on performInsert, the second parm?
  
  beginTransaction(); try { performInsert(map, null); } catch (ManifoldCFException e) { // According the Karl, the only two reasons a ManifoldCF exception is thrown // after a performInsert is either constraint violation or dead lock situation. // He continued to say that unless you are in the middle of transaction, you will // only encounter the constraint violation. signalRollback(); if (e.getErrorCode() == ManifoldCFException.DATABASE_TRANSACTION_ABORT) { isDuplicate = true; } else { // We got a different error code, that needs to be addressed. throw e; } } catch (Error e) { signalRollback(); throw e; } finally { endTransaction(); } if (isDuplicate) { System.out.println("[" + currentContext.get("id") + "] Duplicate found, retrying with newDupNum"); ArrayList list = new ArrayList(); list.add(hashsumVal); IResultSet result = performQuery("SELECT max(" + dupnum + ") FROM " + getTableName() + " WHERE " + hashsum + "=?", list, null, null); if (result.getRowCount() == 1) { System.out.println("[" + currentContext.get("id") + "] " + "Found the highest dup number, result set contains " + result.getRowCount() + " row"); IResultRow row = result.getRow(0); Integer oldDupNum = (Integer) row.getValue("max"); int newDupNum = oldDupNum.intValue() + 1; System.out.println("[" + currentContext.get("id") + "] " + "oldDupNum=" + oldDupNum + ", newDupNum=" + newDupNum); map.put(dupnum, new Integer(newDupNum)); beginTransaction(); try { performInsert(map, null); } catch (ManifoldCFException f) { signalRollback(); throw f; } catch (Error f) { signalRollback(); throw f; } finally { endTransaction(); } } else { // This case happens when either no rows or more than one row is returned for the // query. It should never happen, because we are looking for a max and the fact // that the initial insert failed says there is at least one row with a value. } }
  Permalink
  
  Oct 23, 2010
  
  Delete comments
  1. Karl Wright
    (1) I think you misunderstood. There is no need for a transaction. In fact, I believe I said, "as long as the insert is OUTSIDE of a transaction, there is no ambiguity".
    
    (2) What is the cache key you are using for your table queries? It seems to be null. So an invalidation cache key of null is appropriate for the insertion.
    
    Permalink
    
    Oct 23, 2010
    
    Delete comments
    1. Farzad
      
      Thanks for the re-clarification in item 1. As far as cache keys, the CookieManager example didn't use one during its performInsert so I didn't either. However going back through the code, I see the ICacheManager object that it sets up in the constructor. It then gets an ICacheHandle using "COOKIES_" + <db column name> and before the endTransaction() it calls ICacheManager.invalidateKeys passing the ICacheHandle.
      
      I don't understand what this doing or its purpose to apply it to my case. Is cache keys a general concept I'm missing?
      
      Permalink
      
      Oct 23, 2010
      
      Delete comments
      1. Karl Wright
        
        I would suggest buying the book when it comes out before you attempt to work with caching. Right now you are effectively querying with caching disabled, and that should be ok for your purposes.
        
        Permalink
        
        Oct 23, 2010
        
        Delete comments
        
        Farzad
        
        Fair enough, if caching is disabled then why am I getting these exceptions. When these exceptions are thrown, the crawler becomes unstable, stops, and can't even abort the job without issuing a shutdown (ctrl-c). Also there was a database reset exception as well, look at the last post I made on the 21st.
        
        Permalink
        
        Oct 23, 2010
        
        Delete comments
        
        Karl Wright
        
        I cannot debug your code for you. You will have to do that yourself. You have seen that the web connector performs updates and inserts and queries all over the place. That connector works just fine without any of the kinds of errors you are reporting, so you must be doing something incorrect.
        
        A database reset occurs when there is any kind of database error, including bad queries or other syntactical errors on your part. I strongly suggest debugging your code.
        
        Permalink
        
        Oct 23, 2010
        
        Delete comments
        
        Farzad
        
        The web connector uses a cachemanager that I'm not and agree don't need it. I'll dig more. It crawls 7K or so before crashing, so if I'm doing something wrong, why wouldn't it fail from the start or much earlier. Perhaps the framework doesn't like it if you have db tables without a cachemanager. I'll look more still, was just looking for ideas, not debug my code
        
        Permalink
        
        Oct 23, 2010
        
        Delete comments
        
        Karl Wright
        
        I'd suggest looking very carefully at how you maintain your manager table reference from the main connector class. If you look at WebConnector, you will note that it goes out of its way to avoid keeping around any IThreadContext object, or anything that was "made" using it, beyond the scope of the setThreadContext(something)/setThreadContext(null) interval. If you do wind up inadvertantly persisting something, you've effectively linked it across threads, which will cause obviously unpredictable results.
        
        Permalink
        
        Oct 23, 2010
        
        Delete comments
        
        Farzad
        
        And if something is linked across threads, the cache manager is needed to retrieve the objects to operate on when needed, since I don't have I one, I get an error.
        
        When is the book coming out?
        
        Permalink
        
        Oct 24, 2010
        
        Delete comments
        
        Karl Wright
        
        It should start to be available through the Manning Early Access Program starting in about March or April.
        
        Permalink
        
        Oct 24, 2010
        
        Delete comments
        
        Farzad
        
        I seem to had two major problems. I was caching the thread context, I cleaned them all up. The other one is a race condition problem. Two threads are trying to do an insert, both fail, they do the second query and find the same max number. The last txn will fail. So I made the second part of my algorithm synchronized. Same thing happen on the first insert. About to make the whole insert method synchronized. Before doing that I was wondering if you have other suggestion about solving this problem. It seems making it synchronized defeats having multiple threads. My other choice would be to keep another table with the highest dup numbers and lock it out during a write operation. Thoughts?
        
        Permalink
        
        Oct 24, 2010
        
        Delete comments
        
        Karl Wright
        
        I would use the IDFactory.make() method to generate a unique identifier. See the class org.apache.manifoldcf.core.interfaces.IDFactory.
        
        Permalink
        
        Oct 24, 2010
        
        Delete comments
        
        Farzad
        
        Unique ID for what? Synchronized didn't work, cause I don't have the multiple threads one object problem. I used a retry loop to solve the problem.
        
        It crawled 47093 files before reporting this error. I think there is something not going well in the db layer. This exception didn't even get to my code.
        
        Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: unexpected chunk number 2 (expected 0) for toast value 34938 in pg_toast_2619
        org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: unexpected chunk number 2 (expected 0) for toast value 34938 in pg_toast_2619
        at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
        at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:449)
        at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
        at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
        at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
        at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:754)
        at org.apache.manifoldcf.crawler.jobs.JobManager.processParentHashSet(JobManager.java:3609)
        at org.apache.manifoldcf.crawler.jobs.JobManager.calculateAffectedRestoreCarrydownChildren(JobManager.java:3578)
        at org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:3503)
        at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:591)
        Caused by: org.postgresql.util.PSQLException: ERROR: unexpected chunk number 2 (expected 0) for toast value 34938 in pg_toast_2619
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:351)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:255)
        at org.apache.manifoldcf.core.database.Database.execute(Database.java:552)
        at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)
        
        Permalink
        
        Oct 25, 2010
        
        Delete comments
        
        Karl Wright
        
        That problem seems to be a bug in PostgreSQL itself - or maybe the JDBC driver is incompatible with your version. What version of Postgres are you running?
        
        Permalink
        
        Oct 25, 2010
        
        Delete comments
        
        Farzad
        
        PostgreSQL 9.0, the only that had a 64 bit windows driver. I could down grade my OS and install the older version, or could use me as guinea pig to test this one out.
        
        Permalink
        
        Oct 25, 2010
        
        Delete comments
        
        Farzad
        
        I just checked and PostgreSQL is running in 32 bit on the OS, not 64.
        
        Permalink
        
        Oct 25, 2010
        
        Delete comments
        
        Karl Wright
        
        You may need to download the compatible JDBC driver from Postgresql also, and replace the one in dist/examples/lib with the new one. Or, you could replace the one in modules/lib and rebuild - up to you.
        
        Permalink
        
        Oct 25, 2010
        
        Delete comments

Anonymous

Link is broken.

Permalink

Child pages

46 Comments

Anonymous