Here is a simple batch script for the DISCO command to retrieve all the sharepoint WSDL files.
------------
echo off
REM
REM with Visual Studio 2005 installed, this is disco's location
REM
set DISCOCMD="C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin\disco.exe"
REM
REM adjust to a svc account or active directory account that has permission to
REM the sharepoint server.
REM
set USERNAME=
set PASSWORD=
set DOMAIN=
I checked out the code from https://svn.apache.org/repos/asf/incubator/lcf/trunk and ran a successful build with ant (didn't include any proprietary stuff).
But I can't find a modules/dist/processes/define directory.
Did a miss a step during the installation? Can someone please give me a hint?
If you didn't include any connectors that require defines, that directory won't be generated at all, I believe. The connector that generates most defines is the jcifs one, so if you didn't include that you are getting the expected results.
Oops, it seems I misunderstood the instruction, thanks.
But I'm still stuck with setting up the LCF... I created the DB and installed the agent.
When I go to http://localhost:8080/lcf-crawler-ui/ after starting the agent, I always receive a NullPointerException:
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: java.lang.NullPointerException
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:491)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:419)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
root cause
java.lang.NullPointerException
org.apache.lcf.core.system.Logging.newLogger(Logging.java:152)
org.apache.lcf.core.system.Logging.initializeLoggers(Logging.java:86)
org.apache.lcf.authorities.system.Logging.initializeLoggers(Logging.java:40)
org.apache.lcf.authorities.system.LCF.initializeEnvironment(LCF.java:50)
org.apache.lcf.crawler.system.LCF.initializeEnvironment(LCF.java:80)
org.apache.jsp.index_jsp._jspService(index_jsp.java:111)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:377)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
Logfile:
org.apache.catalina.core.StandardWrapperValve invoke
FATAL: Servlet.service() for servlet jsp threw exception
java.lang.NullPointerException
at org.apache.lcf.core.system.Logging.newLogger(Logging.java:152)
at org.apache.lcf.core.system.Logging.initializeLoggers(Logging.java:86)
at org.apache.lcf.authorities.system.Logging.initializeLoggers(Logging.java:40)
at org.apache.lcf.authorities.system.LCF.initializeEnvironment(LCF.java:50)
at org.apache.lcf.crawler.system.LCF.initializeEnvironment(LCF.java:80)
at org.apache.jsp.index_jsp._jspService(index_jsp.java:111)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:377)
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:619)
After you build, look under dist/processes/script and you will see scripts designed to execute commands. They require certain environment variables (as described above): JAVA_HOME and MCF_HOME. There's an executecommand.bat for Windows users.
I just rebuilt my dist becuase of the project name change, I think I was using old jars trying the new commands. Also I'm just using the default connectors, don't have any additional defines. I'm still having trouble using the executecommand.bat. Here is what I'm trying, is this a syntax issue? postgres is the dbuser name. I assume the command will prompt for the password.
Ran into problem with quick-start and thought I might have better luck if I manually setup the system. Maybe you can shed a light on the quick-start problem. Here is what happened, after running start.jar, I went to the crawler UI, configured a null output and a file system repo connector. Created a job pointing to a file share \\host\share and started the job. After a few seconds I ran into the error message below in the job status panel. It said 60 docs found, 9 active, and 52 processed. Any ideas as to why I'm seeing this?
Error: A lock could not be obtained due to a deadlock, cycle of locks and waiters is: Lock : ROW, INGESTSTATUS, (1,57) Waiting XID :
Unknown macro: {6293, X}
, APP, DELETE FROM ingeststatus WHERE urihash=? AND dockey!=? AND connectionname=? Granted XID :
(2) Give me a few minutes to look at the code, but if you cannot wait, you *can* use the quick-start with Postgresql. But if you have a few minutes, wait for my response.
The code is not executing inside transaction at this point, so it is puzzling how Derby could wind up throwing persistent locks across statements, which is what you need to do to cause deadlock to occur. The only possibility I can think of is that there's a bug in the ManifoldCF Derby database implementation class that's causing transactions to persist. The only other alternative is a bug in Derby itself.
I can find this if it happens enough to trigger the deadlock regularly. How often does this happen to you? I have never seen it before myself.
In any case, please do open a Jira ticket, since this is certainly the wrong forum for problems of this kind.
I opened a ticket,https://issues.apache.org/jira/browse/CONNECTORS-111. So I did a bit more experimenting, and it seems to be caused the first time I run. First time after a new install. I restarted the job and it ended sucessfully.
Anythought as to why I get the error messsage trying to unregister an output connector?
C:\Program Files\Apache\apache-cf\dupfinder>.\processes\script\executecommand.b
at org.apache.manifoldcf.agents.UnRegisterOutput DupFinderConnector
The system cannot find the path specified.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/manifoldcf/agents/UnRegisterOutput
Do not try to set the classpath yourself. That's what the script does.
The script is designed to work in the context where there is a "processes" subdirectory that contains all the necessary jars. If the directory you are pointing at doesn't have that setup, you will need to change things if you want to use the script.
The updated JDBC driver was the last issue. I successfully crawled 280225 items, thanks for all your help! Now I'm trying to setup a faster configuartion and improve performance. Can I get access to the 30,000 sample set you used to benchmark. I just got my system upgraded with 8 GB of RAM, and my disks are 7200 RPM. I'd like to be able to compare apples to apples.
Not getting good performance. The system has 8 GB of RAM, two 500 GB disk drives rated at 7200 RPM, Windows 7 64 bit. It improved from the old 5/sec to 9/sec by doing these upgrades. I'd like to get the 31/sec you mentioned. Any thoughts on how to go about this? I don't know if sharing your sample set is problematic or not, do you want me to pull together a set and share that with you. I have some 17 million .eml files that I can use. Appreciate any thoughts or suggestions?
The trunk of Solr has 14,752 files and 9,269 folders. The job completd in 24 mintues and 11 seconds, or 1451 seconds. I'm getting a rate of 16.6 items / sec. If I use only the files the rate is 10.2 files / sec. Did you use the total count or file count?
Is there a tool we can both use to compare the systems. Do you have ManifoldCF, database, and appserver running off the same disk? I first had the data on the same disk as Manifold, then I moved it to a network drive, and the crawl time went up by 35 seconds.
My only goal at this point is to achieve your results on my system.
My overall count was larger, because my solr and lucene had been compiled and built. I counted both folders and files in my docs/second calculations. So my system was performing about 2x as fast as yours.
I have everything on the same disk on this system - nothing fancy. It's a Dell Vostro tower, all standard hardware, windows Vista. I believe I posted processor and memory info earlier.
Any ideas what this error means and how to fix it? Still performance testing and tuning ...
Exception tossed: Error getting connection
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error getting connection
at org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:104)
at org.apache.manifoldcf.core.database.Database.internalTransactionBegin(Database.java:230)
at org.apache.manifoldcf.core.database.Database.synchronizeTransactions(Database.java:217)
at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1079)
at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.beginTransaction(DBInterfacePostgreSQL.java:1001)
at org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:3286)
at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1848)
at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1421)
at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1505)
at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1520)
at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:231)
at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
Caused by: org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:469)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
at org.postgresql.Driver.makeConnection(Driver.java:393)
at org.postgresql.Driver.connect(Driver.java:267)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at com.bitmechanic.sql.ConnectionPool.createDriverConnection(ConnectionPool.java:468)
at com.bitmechanic.sql.ConnectionPool.getConnection(ConnectionPool.java:407)
at com.bitmechanic.sql.ConnectionPoolManager.connect(ConnectionPoolManager.java:442)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.manifoldcf.core.database.ConnectionFactory.getConnectionWithRetries(ConnectionFactory.java:144)
at org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:90)
... 16 more
The key error is: "Caused by: org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections".
You've run out of PostgreSQL connections. How many worker threads do you have? What's the maximum number of connections your PostgreSQL is configured for? The defaults for these values are 30 and 100, respectively - you must have changed one or both of these to get this error.
I don't see the worker threads in conf file. The only thing I saw was autovac_max_workers which is commented out. My max_connections = 400, and I create output and repo connectors in MCF that have 100 connections per JVM, so 100 for ouput and 100 for repo.
The instructions on this page say set the max db connections to 400. And I just found it easier to add a 0 to the default 10 in the throttling tab of the connectors.
The problem is that you also need to increase the amount of shared memory buffers for PostgreSQL when you increase the max number of connections. There is a formula in the postgresql.conf file.
If you haven't completely followed the instructions and haven't actually increased the number of worker threads, however, you do not need to set the max number of connections to anything other than their default of 100.
I didn't see any instructions talking about the need to increase the number of worker threads. Where is that listed?
Something doesn't add up, reading the conf, so each connection costs 400 bytes, each lock cost 270 bytes. The default max lock is 10 per connection.
10 x 400 x 270 = 1,080,000
400 x 400 = 160,000
Total shared memory needed = 1,240,000 bytes = which is 1.2 MB, we are setting the shared memory to 1024MB, almost a 1000 more. Why would there be a problem?
The other angle is I configured Postgre to have 400 connection but MCF is set to 30, not even close.
You do not need to increase the number of worker threads, unless you have a machine capable of massive parallelism, in which case I could have sworn the instructions said it was OK to increase the number of worker threads to around 100. If you do that, you should have a postgresql max connections parameter of 300 or 400.
The instructions for the postgresql parameters are specific to some extent to postgresql 8.x. I don't know what they should be on 9.x. I can only tell you that your symptom is occurring because ManifoldCF is trying to grab more connections that PostgreSQL is willing to give. Since ManifoldCF uses a connection pool, it can only grab one connection per thread.
The only way this would be violated is if a connector was written that uses connection-specific database tables and doesn't properly return handles to the pool after grabbing them. Is there a possibility that this is occurring?
Using the default Null Output and File System connectors for this test. I'm running on an 8 processor system, 8 GB of RAM, with 10,000 RPM drives.
So I seem to be in a pickle, cause I have 30 worker threads, and 400 allowed db connections, I'm getting this error. I'm going to set the db back down to 100 and see what happens.
It's been a while since this was done with the ManifoldCF code base, but the MetaCarta code base on which it is built regularly crawls 5 million or more without any issues of this kind. So the possibilities are:
(1) Misconfiguration of PostgreSQL
(2) Something about PostgreSQL 9.x.
(3) Something broken in the ManifoldCF code that got broken sometime in the last few months.
The obvious cross-check is for someone else (probably me) to do a large crawl while using a properly configured PostgreSQL 8.x, and see where that leads. Unfortunately I probably won't have time to do that until next week, unless I can squeeze in a few moments tonight.
It failed again. So I think I ruled out 1 & 2. I uninstalled 9.0, rebooted, installed 8.3.12 (also reverted back to the original jdbc driver). Didn't make a single change to any of the configs, since I have not changed Manifold's worker count. It failed after 63,953 document, 35,355 active, and 29,174 processed.
If you have any other ideas for me to try, I'd be more than happy to. At this point, I think we are looking at number 3. I am running the example setup with the provided connectors, nothing custom.
I've looked at the default pool size. Without anything in the properties.xml, the default is 200 handles, which is bigger than postgresql's default of 100. (It, however, is reasonable if you set PostgreSQL maximum connections to 400.)
What I'd like you to try is to set the parameter "org.apache.manifoldcf.database.maxhandles" in your properties.xml to a value of "50". This is below PostgresSQL's default, so you should not run out of postgresql handles. I'm trying the same thing right now.
Well my system is still running but very slow. It's been almost 24 hours and it has only crawled 243297 items. This is on the 8 processor system with 10K disks (120 MB reads, and 200 MB writes). I'll let it finish, should I try increasing connections and workers on the next go around?
It is clear then that your attempt to set up PostgreSQL with 400 database handles did not actually succeed, or my recommendation would not have helped.
The performance is still very poor compared with my very cheap system, but your disks now look reasonably quick. So let's try to figure out the problem.
(1) The default of 30 threads sounds low for your system. I'd up this to 100.
(2) You don't want the maximum connections to be a bottleneck, either on the repository connection side or on the output connection side. Set the max connections for both to 105.
(3) Configure your postgreSQL to have at least 200 database handles available. I know you tried to do this already, but for some reason you configuration did not work.
(4) Set your properties.xml maximum database connections parameter to be 105, so that's not a bottleneck either.
(5) You may want to give the JVM more memory than the default. Perhaps you are garbage collecting too much. If you are still using the quick-start, just add appropriate -Xmx and -Xms commands. I'd start with 1024MB.
If none of this helps, then we can figure out what the bottleneck is by getting a Java thread dump while the crawler is active. How you do this depends on what operating system you are using. But it should be possible from that thread dump to get an idea where all the threads are waiting. Post it to connectors-user@incubator.apache.org.
Made the changes, system is still having problems, running but very slow. I started the example with "java -Xms512m -Xmx102
4m -jar start.jar". Got two java dumps at 40K and 80K, emailed them along with my config files to connectors-user@incubator.apache.org.
38 Comments
Anonymous
Here is a simple batch script for the DISCO command to retrieve all the sharepoint WSDL files.
------------
echo off
REM
REM with Visual Studio 2005 installed, this is disco's location
REM
set DISCOCMD="C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin\disco.exe"
REM
REM adjust to a svc account or active directory account that has permission to
REM the sharepoint server.
REM
set USERNAME=
set PASSWORD=
set DOMAIN=
REM
REM sharepoint server base-URL.
REM
set SPURL=http://SHAREPOINTURLHERE/_vti_bin
REM
REM output directory for WSDL files...make sure this directory exists.
REM
set OUTPUT=C:\temp\disco
REM
REM give some status.
REM
echo %DISCOCMD%
echo %USERNAME%
echo %SPURL%
echo %OUTPUT%
REM
REM
REM
set CMD=%DISCOCMD% /o:%OUTPUT% /u:%USERNAME% /p:%PASSWORD% /d:%DOMAIN% %SPURL%
FOR %%A IN ( Permissions Lists Dspsts usergroup versions webs ) DO %CMD%/%%A.asmx
------------
Karl Wright
There's also a tool that Microsoft ships with Visual Studio that allows you to do roughly the same thing, called "disco.exe".
Anonymous
I checked out the code from https://svn.apache.org/repos/asf/incubator/lcf/trunk and ran a successful build with ant (didn't include any proprietary stuff).
But I can't find a modules/dist/processes/define directory.
Did a miss a step during the installation? Can someone please give me a hint?
Karl Wright
If you didn't include any connectors that require defines, that directory won't be generated at all, I believe. The connector that generates most defines is the jcifs one, so if you didn't include that you are getting the expected results.
Anonymous
Oops, it seems I misunderstood the instruction, thanks.
But I'm still stuck with setting up the LCF... I created the DB and installed the agent.
When I go to http://localhost:8080/lcf-crawler-ui/ after starting the agent, I always receive a NullPointerException:
Karl Wright
It looks like you have not properly configured the LCF logging. See the instructions pertaining to properties.ini, and the sample logging.ini.
This kind of discussion is probably best moved to connectors-user@incubator.apache.org.
Farzad
How do you issue the commands? I need to create the db and init the schema. I'm using a windows environment, do I open a command prompt and ...
Thanks!
Karl Wright
After you build, look under dist/processes/script and you will see scripts designed to execute commands. They require certain environment variables (as described above): JAVA_HOME and MCF_HOME. There's an executecommand.bat for Windows users.
Farzad
I just rebuilt my dist becuase of the project name change, I think I was using old jars trying the new commands. Also I'm just using the default connectors, don't have any additional defines. I'm still having trouble using the executecommand.bat. Here is what I'm trying, is this a syntax issue? postgres is the dbuser name. I assume the command will prompt for the password.
C:\Program Files\Apache\apache-acf\processes\script>executecommand.bat -Dorg.apache.manifoldcf.core.DBCreate=postgres
Farzad
What I get is the java command information, just like I have issued an incorrect command.
Karl Wright
You want:
executecommand.bat org.apache.manifoldcf.core.DBCreate postgres
Have you tried using the quick-start instead? It may save you a lot of time.
cd dist/example
<java>/bin/java -jar start.jar
You can run this with postgresql too; you simply need to change properties.xml in dist/example to change the database implementation class.
Farzad
Ran into problem with quick-start and thought I might have better luck if I manually setup the system. Maybe you can shed a light on the quick-start problem. Here is what happened, after running start.jar, I went to the crawler UI, configured a null output and a file system repo connector. Created a job pointing to a file share \\host\share and started the job. After a few seconds I ran into the error message below in the job status panel. It said 60 docs found, 9 active, and 52 processed. Any ideas as to why I'm seeing this?
Error: A lock could not be obtained due to a deadlock, cycle of locks and waiters is: Lock : ROW, INGESTSTATUS, (1,57) Waiting XID :
, APP, DELETE FROM ingeststatus WHERE urihash=? AND dockey!=? AND connectionname=? Granted XID :
Lock : ROW, INGESTSTATUS, (1,55) Waiting XID :
, APP, INSERT INTO ingeststatus (id,changecount,dockey,lastversion,firstingest,connectionname,authorityname,urihash,lastoutputversion,lastingest,docuri) VALUES (?,?,?,?,?,?,?,?,?,?,?) Granted XID :
. The selected victim is XID : 6293.
Thanks!
Karl Wright
I don't know what this is, other than it is clearly a deadlock that Derby is encountering.
My advice to you is the following:
(1) Create a ticket under JIRA: https://issues.apache.org/jira , project name ManifoldCF.
(2) Give me a few minutes to look at the code, but if you cannot wait, you *can* use the quick-start with Postgresql. But if you have a few minutes, wait for my response.
Karl Wright
The code is not executing inside transaction at this point, so it is puzzling how Derby could wind up throwing persistent locks across statements, which is what you need to do to cause deadlock to occur. The only possibility I can think of is that there's a bug in the ManifoldCF Derby database implementation class that's causing transactions to persist. The only other alternative is a bug in Derby itself.
I can find this if it happens enough to trigger the deadlock regularly. How often does this happen to you? I have never seen it before myself.
In any case, please do open a Jira ticket, since this is certainly the wrong forum for problems of this kind.
Karl
Farzad
I opened a ticket,https://issues.apache.org/jira/browse/CONNECTORS-111. So I did a bit more experimenting, and it seems to be caused the first time I run. First time after a new install. I restarted the job and it ended sucessfully.
Farzad
Anythought as to why I get the error messsage trying to unregister an output connector?
C:\Program Files\Apache\apache-cf\dupfinder>.\processes\script\executecommand.b
at org.apache.manifoldcf.agents.UnRegisterOutput DupFinderConnector
The system cannot find the path specified.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/manifoldcf/agents/UnRegisterOutput
I echo'd the command and it is constructed as:
"C:\Program Files\Java\jdk1.5.0_22\bin\java" "-Dorg.apache.manifoldcf.configfile=C:\Program Files\Apache\apache-acf\dupfinder\example\properties.xml" -classpath "." org.apache.manifoldcf.agents.UnRegisterOutput DupFinderConnector
My MCF_HOME is set to C:\Program Files\Apache\apache-cf\dupfinder\example and I'm using PostgresSQL db instead of embedded derby.
Karl Wright
Do not try to set the classpath yourself. That's what the script does.
The script is designed to work in the context where there is a "processes" subdirectory that contains all the necessary jars. If the directory you are pointing at doesn't have that setup, you will need to change things if you want to use the script.
Farzad
The updated JDBC driver was the last issue. I successfully crawled 280225 items, thanks for all your help! Now I'm trying to setup a faster configuartion and improve performance. Can I get access to the 30,000 sample set you used to benchmark. I just got my system upgraded with 8 GB of RAM, and my disks are 7200 RPM. I'd like to be able to compare apples to apples.
Farzad
Not getting good performance. The system has 8 GB of RAM, two 500 GB disk drives rated at 7200 RPM, Windows 7 64 bit. It improved from the old 5/sec to 9/sec by doing these upgrades. I'd like to get the 31/sec you mentioned. Any thoughts on how to go about this? I don't know if sharing your sample set is problematic or not, do you want me to pull together a set and share that with you. I have some 17 million .eml files that I can use. Appreciate any thoughts or suggestions?
Karl Wright
My test set is the lucene/solr trunk. Check it out using svn and crawl that, including the .svn directories, and see how fast it is for you.
Farzad
The trunk of Solr has 14,752 files and 9,269 folders. The job completd in 24 mintues and 11 seconds, or 1451 seconds. I'm getting a rate of 16.6 items / sec. If I use only the files the rate is 10.2 files / sec. Did you use the total count or file count?
Is there a tool we can both use to compare the systems. Do you have ManifoldCF, database, and appserver running off the same disk? I first had the data on the same disk as Manifold, then I moved it to a network drive, and the crawl time went up by 35 seconds.
My only goal at this point is to achieve your results on my system.
Karl Wright
My overall count was larger, because my solr and lucene had been compiled and built. I counted both folders and files in my docs/second calculations. So my system was performing about 2x as fast as yours.
I have everything on the same disk on this system - nothing fancy. It's a Dell Vostro tower, all standard hardware, windows Vista. I believe I posted processor and memory info earlier.
Farzad
Any ideas what this error means and how to fix it? Still performance testing and tuning ...
Karl Wright
The key error is: "Caused by: org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections".
You've run out of PostgreSQL connections. How many worker threads do you have? What's the maximum number of connections your PostgreSQL is configured for? The defaults for these values are 30 and 100, respectively - you must have changed one or both of these to get this error.
Farzad
I don't see the worker threads in conf file. The only thing I saw was autovac_max_workers which is commented out. My max_connections = 400, and I create output and repo connectors in MCF that have 100 connections per JVM, so 100 for ouput and 100 for repo.
Karl Wright
The worker threads are in ManifoldCF's properties.xml, not postgresql.conf.
Farzad
The instructions on this page say set the max db connections to 400. And I just found it easier to add a 0 to the default 10 in the throttling tab of the connectors.
Karl Wright
The problem is that you also need to increase the amount of shared memory buffers for PostgreSQL when you increase the max number of connections. There is a formula in the postgresql.conf file.
If you haven't completely followed the instructions and haven't actually increased the number of worker threads, however, you do not need to set the max number of connections to anything other than their default of 100.
Farzad
I didn't see any instructions talking about the need to increase the number of worker threads. Where is that listed?
Something doesn't add up, reading the conf, so each connection costs 400 bytes, each lock cost 270 bytes. The default max lock is 10 per connection.
10 x 400 x 270 = 1,080,000
400 x 400 = 160,000
Total shared memory needed = 1,240,000 bytes = which is 1.2 MB, we are setting the shared memory to 1024MB, almost a 1000 more. Why would there be a problem?
The other angle is I configured Postgre to have 400 connection but MCF is set to 30, not even close.
Something is missing
Karl Wright
You do not need to increase the number of worker threads, unless you have a machine capable of massive parallelism, in which case I could have sworn the instructions said it was OK to increase the number of worker threads to around 100. If you do that, you should have a postgresql max connections parameter of 300 or 400.
The instructions for the postgresql parameters are specific to some extent to postgresql 8.x. I don't know what they should be on 9.x. I can only tell you that your symptom is occurring because ManifoldCF is trying to grab more connections that PostgreSQL is willing to give. Since ManifoldCF uses a connection pool, it can only grab one connection per thread.
The only way this would be violated is if a connector was written that uses connection-specific database tables and doesn't properly return handles to the pool after grabbing them. Is there a possibility that this is occurring?
Farzad
Using the default Null Output and File System connectors for this test. I'm running on an 8 processor system, 8 GB of RAM, with 10,000 RPM drives.
So I seem to be in a pickle, cause I have 30 worker threads, and 400 allowed db connections, I'm getting this error. I'm going to set the db back down to 100 and see what happens.
I uploaded my configs to http://www.farzad.net/manifoldcf in case I over looked something.
Oh, the other thing, this happens around doc count of 60,000. Have you tested with a very large test set, perhaps 250,000 or 500,000?
Karl Wright
It's been a while since this was done with the ManifoldCF code base, but the MetaCarta code base on which it is built regularly crawls 5 million or more without any issues of this kind. So the possibilities are:
(1) Misconfiguration of PostgreSQL
(2) Something about PostgreSQL 9.x.
(3) Something broken in the ManifoldCF code that got broken sometime in the last few months.
The obvious cross-check is for someone else (probably me) to do a large crawl while using a properly configured PostgreSQL 8.x, and see where that leads. Unfortunately I probably won't have time to do that until next week, unless I can squeeze in a few moments tonight.
Farzad
In the meantime, I'm going to install 8.x to reduce the variables and see what happens.
Farzad
It failed again. So I think I ruled out 1 & 2. I uninstalled 9.0, rebooted, installed 8.3.12 (also reverted back to the original jdbc driver). Didn't make a single change to any of the configs, since I have not changed Manifold's worker count. It failed after 63,953 document, 35,355 active, and 29,174 processed.
If you have any other ideas for me to try, I'd be more than happy to. At this point, I think we are looking at number 3. I am running the example setup with the provided connectors, nothing custom.
Karl Wright
I've looked at the default pool size. Without anything in the properties.xml, the default is 200 handles, which is bigger than postgresql's default of 100. (It, however, is reasonable if you set PostgreSQL maximum connections to 400.)
What I'd like you to try is to set the parameter "org.apache.manifoldcf.database.maxhandles" in your properties.xml to a value of "50". This is below PostgresSQL's default, so you should not run out of postgresql handles. I'm trying the same thing right now.
Farzad
Well my system is still running but very slow. It's been almost 24 hours and it has only crawled 243297 items. This is on the 8 processor system with 10K disks (120 MB reads, and 200 MB writes). I'll let it finish, should I try increasing connections and workers on the next go around?
Karl Wright
It is clear then that your attempt to set up PostgreSQL with 400 database handles did not actually succeed, or my recommendation would not have helped.
The performance is still very poor compared with my very cheap system, but your disks now look reasonably quick. So let's try to figure out the problem.
(1) The default of 30 threads sounds low for your system. I'd up this to 100.
(2) You don't want the maximum connections to be a bottleneck, either on the repository connection side or on the output connection side. Set the max connections for both to 105.
(3) Configure your postgreSQL to have at least 200 database handles available. I know you tried to do this already, but for some reason you configuration did not work.
(4) Set your properties.xml maximum database connections parameter to be 105, so that's not a bottleneck either.
(5) You may want to give the JVM more memory than the default. Perhaps you are garbage collecting too much. If you are still using the quick-start, just add appropriate -Xmx and -Xms commands. I'd start with 1024MB.
If none of this helps, then we can figure out what the bottleneck is by getting a Java thread dump while the crawler is active. How you do this depends on what operating system you are using. But it should be possible from that thread dump to get an idea where all the threads are waiting. Post it to connectors-user@incubator.apache.org.
Thanks,
Karl
Farzad
Made the changes, system is still having problems, running but very slow. I started the example with "java -Xms512m -Xmx102
4m -jar start.jar". Got two java dumps at 40K and 80K, emailed them along with my config files to connectors-user@incubator.apache.org.