This page has been deprecated!
Go to http://manifoldcf.apache.org/release/trunk/en_US/writing-output-connectors.html instead.
This page has been deprecated!
Go to http://manifoldcf.apache.org/release/trunk/en_US/writing-output-connectors.html instead.
46 Comments
Farzad
I'd like to create an output connector where I can detect duplicate files (documents) in the repository. My thought is to store a hashsum value for each document, SHA-512, and store that in a table and set it as a primary key. Subsequent insert with the same value would be rejected indicating a duplicate.
I read connectors are allowed to have their own tables. What would the table name be? I issue SQL over JDBC to modify table data? How do you establish a connection to the db? Any other thoughts about my approach?
Thanks!
Karl Wright
The best answer I can give right at the moment is to look at what the web connector does for tables. You will find that it creates two of them. Follow the same conventions and you should be OK. Naming is up to you, but obviously a good practice is to use a prefix that is likely to keep your table name from colliding with other tables from other connectors, or from future versions of the framework. If or when there is a ManifoldCF book, be assured that this topic will be covered in depth.
Farzad
Thanks for the pointer, didn't know I'd be naming them. Any info on setting up the project in Eclipse to build out connectors? I tried using the build file and it didn't work. I also did it a project checkout and that just created a general project. Should I just pull the source I need and resolve linkage issues, is that the way?
Karl Wright
You are on your own as far as Eclipse is concerned. The conditional nature of connector builds does not readily lend itself to an Eclipse-style single project.
One route would be to add your connector to the main build.xml, which will then automatically register it in connectors.xml in the example. If you tried this already I am puzzled as to why it did not work; it's the standard way new connectors get added.
Farzad
I'm just getting started to write an output connector. I've got nothing to add yet. I'll interogate the build.xml file to figure out how to setup the source relationships in Eclipse.
Farzad
Confused a bit with NullOutputConnector source. So it seems, the only three methods you need to implement are: addOrReplaceDocument, getOutputDescription, and removeDocument. I see a method called getSession that does nothing and it is being called in addOrReplaceDocument and removeDocument methods of NullOutputConnector. Is that just empty code? Is it save to assume, connect, disconnect are also not needed based on getSession? Was my assumption true about needing to implement only the three methods. At this point, I just want to create code that identifies duplicates. Not worried about security just yet.
Karl Wright
getSession() is empty code only in the null output connector. Normally that's where you'd set up your connection to your output target server. The code was left structured that way to demonstrate the proper way of setting up connection sessions. You can't do the actual connection in the connect() method because (as you may or may not have noticed), there is no way to throw an exception from that method.
If you are simply identifying duplicates, then no session is needed, and you can remove connect(), disconnect(), getSession(), etc.
Farzad
So I managed to get my connector built and running. The goals was to create a table in the database and see it populate with some unique string, my name I encountered two problems, the job never ends, my file set was 9. It said 9 processed, but also 6 active, and never showed an end date. Also I thought I created a table in the db called "filedata", but when I browse the database with phpPgAdmin, I don't see that under Schemas\public\Tables. Any thoughts? I will admit I'm struggling to grasp the flow. If there are other reading materials or links you think will remove the fog, please share.
I upload my dupfinder connector source directory, as well as the modules\build.xml file to http://www.farzad.net/manifoldcf/.
Karl Wright
I took a quick glance. In order for your table to get created, you need to implement the install() and uninstall() methods of the connector. See the web connector for an example. These methods should call your appropriate table manager methods.
BEFORE you do this you want to unregister your output connector using the appropriate "unregisteroutput" command via shell script. If you don't then when you start up Manifold you will get an exception, because it's trying to tear down a table that doesn't yet exist. It is at registration time that the install() method will be called.
Karl
Farzad
I got an error trying to unregister, I posted that under "How to build and deploy ManifoldCF" where it talks about issuing commands.
I got around the problem by dropping the table and doing a clean build. Now the table is created, however it is empty, no rows. dataManager.insertData is getting called inside addOrReplaceDocument.
The job status says Documents=9, Active=6, and Processed=9 and the job doesn't end. I am returning DOCUMENTSTATUS_ACCEPTED. I start the job manually.
I traced the calls and I see where deinstall and install are called initially. When I start the job, for some reason addOrReplaceDocument is called 45 times, and getOutputDescription is called 48 times. I only have 9 docs, why so many times?
Karl Wright
addOrReplaceDocument is only called when a document is indexed. But if your code throws an exception, the framework code will repeat the process. I suggest you check the output log to see the exception.
Farzad
You were right. My dataManager was not initialized, so it was throwing NullPointer exceptions. After that I had a primary key violation, and I liked how the UI reported that back. Finally, I was using a counter for the primary key and needed to make it static because of the multiple instances of the connector, so that it incremented correctly.
One interesting thing I learnt was that MCF doesn't reprocess the same item twice. I had a failed job that processed two item before stopping. Those two items were not sent again to newer jobs for the same file system path. Only after I deleted the jobs, and created a new one that they were sent again.
Karl Wright
Yes, that's because it is an incremental crawler. You can make it resend everything to your output connection by clicking the link in the output connection's view screen titled something like "recrawl everything for this connection" or some such.
Farzad
Error definging multiple primary keys. So I'm using the org.apache.manifoldcf.core.interfaces.ColumnDescription class to setup the db tables. If I specify two columns to be primary keys then I get the following error. It makes it sound like I can have multiple keys if I define the table correctly, ie. for table XX not allowed, as if allowed for table YY. I didn't see any documentation to help me understand it. Any ideas?
C:\Program Files\Apache\apache-acf\dupfinder\example>java -jar start.jar >> output.txt
Configuration file successfully read
Successfully unregistered all output connectors
Successfully unregistered all authority connectors
Successfully unregistered all repository connectors
Successfully registered output connector 'org.apache.manifoldcf.agents.output.solr.SolrConnector'
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: multiple primary keys for table "filedata" are not allowed
at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:449)
at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performCreate(DBInterfacePostgreSQL.java:255)
at org.apache.manifoldcf.core.database.BaseTable.performCreate(BaseTable.java:111)
at org.apache.manifoldcf.agents.output.dupfinder.DataManager.install(DataManager.java:57)
at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.install(DupFinderConnector.java:98)
at org.apache.manifoldcf.agents.interfaces.OutputConnectorFactory.install(OutputConnectorFactory.java:51)
at org.apache.manifoldcf.agents.outputconnmgr.OutputConnectorManager.registerConnector(OutputConnectorManager.java:169)
at org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:348)
Caused by: org.postgresql.util.PSQLException: ERROR: multiple primary keys for table "filedata" are not allowed
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:337)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:329)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:526)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)
C:\Program Files\Apache\apache-acf\dupfinder\example>
Farzad
I was looking at the code for DBInterfacePostgreSQL in org.apache.manifoldcf.core.database and looks like it is not setup to handle multiple primary keys. Aren't primary keys suppose to be listed together as a comma separated list?
protected static void appendDescription(StringBuffer queryBuffer, String columnName, ColumnDescription cd, boolean forceNull)
{
queryBuffer.append(columnName);
queryBuffer.append(' ');
queryBuffer.append(mapType(cd.getTypeString()));
if (forceNull || cd.getIsNull())
queryBuffer.append(" NULL");
else
queryBuffer.append(" NOT NULL");
if (cd.getIsPrimaryKey())
queryBuffer.append(" PRIMARY KEY");
if (cd.getReferenceTable() != null)
}
Karl Wright
The ManifoldCF database abstraction is, of course, a limited one. You can't do everything that you can do in every kind of database there is out there. However, you *can* effectively define the equivalent of a multiple primary key relationship simply by doing the following:
(1) create your columns, labeling NONE of them as "primary key".
(2) create a unique index on the columns you want to be your multiple primary key.
In PostgreSQL this does the same thing as a multiple primary key.
Farzad
Having the index would not allow me to use the insert operation to detect the dups. The algorithm is using two columns one for the hashsum, and another for the dup number, where it is incremented with each duplicate. The algorithm would:
1) attempt an insert with hashsum X and dupnum = 1
2) if I get 23505 SQL exception, then I'll find the largest dup number, and try another insert with one plus the largest.
What do you think of this? Noticed there is a method already called performAlter, however it only operates on columns, add, delete, etc. What do you think if I added another method to allow adding primary key constraints, allow to perform this alter "ALTER TABLE mytable ADD CONSTRAINT _pk PRIMARY KEY (col1, col2, ...).
Karl Wright
Like I said before, creating a unique index creates exactly the same Postgresql constraints as having multiple primary keys. The same (or a very similar) SQL error is returned when the constraint is violated.
Farzad
You were right : ) Here is what you get, when I added:
ArrayList list = new ArrayList();
list.add(hashsum);
list.add(dupnum);
addTableIndex(true, list);
Error: ERROR: duplicate key value violates unique constraint "i1287662986761" Detail: Key (hashsum, dupnum)=(C46875547F6B97BAC41132F6F8A057CC10060FBC69B5B26428D6D561E00AE1F1B1C8BBD5664FC4C94E95A5AC31045C3EAA8AE11DB19A697CC410F3EC9E233D38, 1) already exists.
So I don't see how to isolate the primary key violation. All I see in ManifoldCFException is DATABASE_TRANSACTION_ABORT, which is what I get when I call getErrorCode(). How do I know I have a primary key violation vs other transaction errors?
Karl Wright
The only other transaction error that DATABASE_TRANSACTION_ABORT can represent is deadlock. So unless you are in a transaction you should not see that one.
Farzad
I got it working, thanks. A problem that took a lot of time was the column names. I defined all mine in upper case. For some reason, they are converted to lower case. I was getting an exception thrown, because I was looking for a column value using the original handle.
So is this a bug? If not, I think it should be documented to use lowercase only for db column names.
Another problem I hit was perfroming a query. It seems that I can only pass the value using an array list. Here is my first call that didn't work.
IResultSet result = performQuery("SELECT * FROM " + getTableName() + " WHERE " + hashsum + "=" + hashsumVal, null, null, null);
It gives an error the column name doesn't exist, where it is using the hashsumVal as the column name. See below. I then used the ArrayList method and it worked. Is there something I did wrong in my first attempt?
ArrayList list = new ArrayList();
list.add(hashsumVal);
IResultSet result = performQuery("SELECT * FROM " + getTableName() + " WHERE " + hashsum + "=?", list, null, null);
Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: column "c46875547f6b97bac41132f6f8a057cc10060fbc69b5b26428d6d561e00ae1f" does not exist
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: column "c46875547f6b97bac41132f6f8a057cc10060fbc69b5b26428d6d561e00ae1f" does not exist
at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:465)
at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:754)
at org.apache.manifoldcf.core.database.BaseTable.performQuery(BaseTable.java:229)
at org.apache.manifoldcf.agents.output.dupfinder.DataManager.insertData(DataManager.java:116)
at org.apache.manifoldcf.agents.output.dupfinder.DupFinderConnector.addOrReplaceDocument(DupFinderConnector.java:74)
at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1424)
at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:409)
at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:304)
at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1586)
at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:275)
at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
Caused by: org.postgresql.util.PSQLException: ERROR: column "c46875547f6b97bac41132f6f8a057cc10060fbc69b5b26428d6d561e00ae1f" does not exist
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:337)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:329)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:526)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)
Farzad
When is the currentContext handle in BaseConnector valid? I tried to add an ID object in the constructor, so I can track the debug statements and the example wouldn't even start.
Configuration file successfully read
Exception in thread "main" java.lang.ClassCastException: java.lang.NullPointerException cannot be cast to org.apache.manifoldcf.core.interfaces.ManifoldCFException
at org.apache.manifoldcf.agents.interfaces.OutputConnectorFactory.getConnectorNoCheck(OutputConnectorFactory.java:149)
at org.apache.manifoldcf.agents.interfaces.OutputConnectorFactory.deinstall(OutputConnectorFactory.java:60)
at org.apache.manifoldcf.agents.outputconnmgr.OutputConnectorManager.unregisterConnector(OutputConnectorManager.java:199)
at org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:256)
Farzad
So I figured out an answer, please verify it is correct. It seems thread context is set when a crawl is happening which makes sense. I overwrote the setThreadContext method and assigned an id. It works, is this the right way?
Farzad
While there are no jobs running, right after a restart, and I click on Job Status in the UI, the framework seems to be calling the setThreadContext and followed by the clearThreadContext every so often. Sometimes, it uses an old thread, sometimes a new thread. Is there something I'm not doing to say the connector is done or not active?
Karl Wright
The thread context is set whenever the connector instance is grabbed by a thread, and must be forgetten when the connector instance is released by that thread. Furthermore, connector instances are pooled, so there is no guarantee that the same instance will be used for subsequent operations within the same crawl.
If you are trying to use the thread context and finding it to be null, you are by definition using it incorrectly.
Farzad
Any thoughts on what this means and how it can happen? Thanks!
Farzad
This one also pops up for some reason too and then the crawl really gets slow, or in this case MCF just got locked up. Had to stop and restart the service to even terminate the job.
Karl Wright
It sounds like you are performing transactions within your connector and not structuring them properly. All transactions MUST be structured using the following paradigm, which you should see in the web connector classes I pointed you towards earlier:
begin transaction
try
{
...
}
catch (any exceptions)
{
signalrollback
rethrow exception
}
finally
{
endtransaction
}
Farzad
Thanks for your answer. I tried that way first and ran into problems. I have to execute multiple txns depending if a dup is found. So I just tried two try catch blocks using a flag, instead of the nested version I first tried. I'm getting these errors now:
Here is the code I'm executing. Also what is invalidate keys used for on performInsert, the second parm?
Karl Wright
(1) I think you misunderstood. There is no need for a transaction. In fact, I believe I said, "as long as the insert is OUTSIDE of a transaction, there is no ambiguity".
(2) What is the cache key you are using for your table queries? It seems to be null. So an invalidation cache key of null is appropriate for the insertion.
Farzad
Thanks for the re-clarification in item 1. As far as cache keys, the CookieManager example didn't use one during its performInsert so I didn't either. However going back through the code, I see the ICacheManager object that it sets up in the constructor. It then gets an ICacheHandle using "COOKIES_" + <db column name> and before the endTransaction() it calls ICacheManager.invalidateKeys passing the ICacheHandle.
I don't understand what this doing or its purpose to apply it to my case. Is cache keys a general concept I'm missing?
Karl Wright
I would suggest buying the book when it comes out before you attempt to work with caching. Right now you are effectively querying with caching disabled, and that should be ok for your purposes.
Farzad
Fair enough, if caching is disabled then why am I getting these exceptions. When these exceptions are thrown, the crawler becomes unstable, stops, and can't even abort the job without issuing a shutdown (ctrl-c). Also there was a database reset exception as well, look at the last post I made on the 21st.
Karl Wright
I cannot debug your code for you. You will have to do that yourself. You have seen that the web connector performs updates and inserts and queries all over the place. That connector works just fine without any of the kinds of errors you are reporting, so you must be doing something incorrect.
A database reset occurs when there is any kind of database error, including bad queries or other syntactical errors on your part. I strongly suggest debugging your code.
Farzad
The web connector uses a cachemanager that I'm not and agree don't need it. I'll dig more. It crawls 7K or so before crashing, so if I'm doing something wrong, why wouldn't it fail from the start or much earlier. Perhaps the framework doesn't like it if you have db tables without a cachemanager. I'll look more still, was just looking for ideas, not debug my code
Karl Wright
I'd suggest looking very carefully at how you maintain your manager table reference from the main connector class. If you look at WebConnector, you will note that it goes out of its way to avoid keeping around any IThreadContext object, or anything that was "made" using it, beyond the scope of the setThreadContext(something)/setThreadContext(null) interval. If you do wind up inadvertantly persisting something, you've effectively linked it across threads, which will cause obviously unpredictable results.
Farzad
And if something is linked across threads, the cache manager is needed to retrieve the objects to operate on when needed, since I don't have I one, I get an error.
When is the book coming out?
Karl Wright
It should start to be available through the Manning Early Access Program starting in about March or April.
Farzad
I seem to had two major problems. I was caching the thread context, I cleaned them all up. The other one is a race condition problem. Two threads are trying to do an insert, both fail, they do the second query and find the same max number. The last txn will fail. So I made the second part of my algorithm synchronized. Same thing happen on the first insert. About to make the whole insert method synchronized. Before doing that I was wondering if you have other suggestion about solving this problem. It seems making it synchronized defeats having multiple threads. My other choice would be to keep another table with the highest dup numbers and lock it out during a write operation. Thoughts?
Karl Wright
I would use the IDFactory.make() method to generate a unique identifier. See the class org.apache.manifoldcf.core.interfaces.IDFactory.
Farzad
Unique ID for what? Synchronized didn't work, cause I don't have the multiple threads one object problem. I used a retry loop to solve the problem.
It crawled 47093 files before reporting this error. I think there is something not going well in the db layer. This exception didn't even get to my code.
Worker thread aborting and restarting due to database connection reset: Database exception: Exception doing query: ERROR: unexpected chunk number 2 (expected 0) for toast value 34938 in pg_toast_2619
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: unexpected chunk number 2 (expected 0) for toast value 34938 in pg_toast_2619
at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:421)
at org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:449)
at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1091)
at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:754)
at org.apache.manifoldcf.crawler.jobs.JobManager.processParentHashSet(JobManager.java:3609)
at org.apache.manifoldcf.crawler.jobs.JobManager.calculateAffectedRestoreCarrydownChildren(JobManager.java:3578)
at org.apache.manifoldcf.crawler.jobs.JobManager.finishDocuments(JobManager.java:3503)
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:591)
Caused by: org.postgresql.util.PSQLException: ERROR: unexpected chunk number 2 (expected 0) for toast value 34938 in pg_toast_2619
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:1548)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1316)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:191)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:452)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:351)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:255)
at org.apache.manifoldcf.core.database.Database.execute(Database.java:552)
at org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:381)
Karl Wright
That problem seems to be a bug in PostgreSQL itself - or maybe the JDBC driver is incompatible with your version. What version of Postgres are you running?
Farzad
PostgreSQL 9.0, the only that had a 64 bit windows driver. I could down grade my OS and install the older version, or could use me as guinea pig to test this one out.
Farzad
I just checked and PostgreSQL is running in 32 bit on the OS, not 64.
Karl Wright
You may need to download the compatible JDBC driver from Postgresql also, and replace the one in dist/examples/lib with the new one. Or, you could replace the one in modules/lib and rebuild - up to you.
Anonymous
Link is broken.