How to Build and Deploy ManifoldCF - Apache Connectors Framework

Anonymous

Here is a simple batch script for the DISCO command to retrieve all the sharepoint WSDL files.
------------

echo off

REM
REM with Visual Studio 2005 installed, this is disco's location
REM
set DISCOCMD="C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin\disco.exe"

REM
REM adjust to a svc account or active directory account that has permission to
REM the sharepoint server.
REM
set USERNAME=
set PASSWORD=
set DOMAIN=

REM
REM sharepoint server base-URL.
REM
set SPURL=http://SHAREPOINTURLHERE/_vti_bin

REM
REM output directory for WSDL files...make sure this directory exists.
REM
set OUTPUT=C:\temp\disco

REM
REM give some status.
REM
echo %DISCOCMD%
echo %USERNAME%
echo %SPURL%
echo %OUTPUT%

REM
REM
REM
set CMD=%DISCOCMD% /o:%OUTPUT% /u:%USERNAME% /p:%PASSWORD% /d:%DOMAIN% %SPURL%

FOR %%A IN ( Permissions Lists Dspsts usergroup versions webs ) DO %CMD%/%%A.asmx

------------

Permalink

Karl Wright
There's also a tool that Microsoft ships with Visual Studio that allows you to do roughly the same thing, called "disco.exe".
- Permalink
- May 28, 2010
- Delete comments

Anonymous

I checked out the code from https://svn.apache.org/repos/asf/incubator/lcf/trunk and ran a successful build with ant (didn't include any proprietary stuff).
But I can't find a modules/dist/processes/define directory.

Did a miss a step during the installation? Can someone please give me a hint?

Permalink

Karl Wright

If you didn't include any connectors that require defines, that directory won't be generated at all, I believe. The connector that generates most defines is the jcifs one, so if you didn't include that you are getting the expected results.

Permalink

Anonymous

Oops, it seems I misunderstood the instruction, thanks.
But I'm still stuck with setting up the LCF... I created the DB and installed the agent.
When I go to http://localhost:8080/lcf-crawler-ui/ after starting the agent, I always receive a NullPointerException:

type Exception report

message

description The server encountered an internal error () that prevented it from  fulfilling this request.

exception

org.apache.jasper.JasperException: java.lang.NullPointerException
	org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:491)
	org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:419)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

root cause

java.lang.NullPointerException
	org.apache.lcf.core.system.Logging.newLogger(Logging.java:152)
	org.apache.lcf.core.system.Logging.initializeLoggers(Logging.java:86)
	org.apache.lcf.authorities.system.Logging.initializeLoggers(Logging.java:40)
	org.apache.lcf.authorities.system.LCF.initializeEnvironment(LCF.java:50)
	org.apache.lcf.crawler.system.LCF.initializeEnvironment(LCF.java:80)
	org.apache.jsp.index_jsp._jspService(index_jsp.java:111)
	org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
	org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:377)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
	org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

Logfile:

 org.apache.catalina.core.StandardWrapperValve invoke
 FATAL: Servlet.service() for servlet jsp threw exception
 java.lang.NullPointerException
 at org.apache.lcf.core.system.Logging.newLogger(Logging.java:152)
 at  org.apache.lcf.core.system.Logging.initializeLoggers(Logging.java:86)
 at  org.apache.lcf.authorities.system.Logging.initializeLoggers(Logging.java:40)
 at  org.apache.lcf.authorities.system.LCF.initializeEnvironment(LCF.java:50)
 at org.apache.lcf.crawler.system.LCF.initializeEnvironment(LCF.java:80)
 at org.apache.jsp.index_jsp._jspService(index_jsp.java:111)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
 at  org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:377)
 at  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
 at  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
 at  org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at  org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at  org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at  org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at  org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at  org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852)
 at  org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at  org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:619)

Permalink

Karl Wright
It looks like you have not properly configured the LCF logging. See the instructions pertaining to properties.ini, and the sample logging.ini.

This kind of discussion is probably best moved to connectors-user@incubator.apache.org.
- Permalink
- May 28, 2010
- Delete comments

Farzad

How do you issue the commands? I need to create the db and init the schema. I'm using a windows environment, do I open a command prompt and ...

Thanks!

Permalink

Karl Wright
After you build, look under dist/processes/script and you will see scripts designed to execute commands. They require certain environment variables (as described above): JAVA_HOME and MCF_HOME. There's an executecommand.bat for Windows users.
- Permalink
- Oct 05, 2010
- Delete comments
1. Farzad
  I just rebuilt my dist becuase of the project name change, I think I was using old jars trying the new commands. Also I'm just using the default connectors, don't have any additional defines. I'm still having trouble using the executecommand.bat. Here is what I'm trying, is this a syntax issue? postgres is the dbuser name. I assume the command will prompt for the password.
  
  C:\Program Files\Apache\apache-acf\processes\script>executecommand.bat -Dorg.apache.manifoldcf.core.DBCreate=postgres
  Permalink
  
  Oct 05, 2010
  
  Delete comments
  1. Farzad
    What I get is the java command information, just like I have issued an incorrect command.
    
    Permalink
    
    Oct 05, 2010
    
    Delete comments
    1. Karl Wright
      
      You want:
      
      executecommand.bat org.apache.manifoldcf.core.DBCreate postgres
      
      Have you tried using the quick-start instead? It may save you a lot of time.
      
      cd dist/example
      
      <java>/bin/java -jar start.jar
      
      You can run this with postgresql too; you simply need to change properties.xml in dist/example to change the database implementation class.
      
      Permalink
      
      Oct 05, 2010
      
      Delete comments
      1. Farzad
        
        Ran into problem with quick-start and thought I might have better luck if I manually setup the system. Maybe you can shed a light on the quick-start problem. Here is what happened, after running start.jar, I went to the crawler UI, configured a null output and a file system repo connector. Created a job pointing to a file share \\host\share and started the job. After a few seconds I ran into the error message below in the job status panel. It said 60 docs found, 9 active, and 52 processed. Any ideas as to why I'm seeing this?
        
        Error: A lock could not be obtained due to a deadlock, cycle of locks and waiters is: Lock : ROW, INGESTSTATUS, (1,57) Waiting XID :
        
        Unknown macro: {6293, X}
        , APP, DELETE FROM ingeststatus WHERE urihash=? AND dockey!=? AND connectionname=? Granted XID :
        
        Unknown macro: {6305, X}
        Lock : ROW, INGESTSTATUS, (1,55) Waiting XID :
        
        , APP, INSERT INTO ingeststatus (id,changecount,dockey,lastversion,firstingest,connectionname,authorityname,urihash,lastoutputversion,lastingest,docuri) VALUES (?,?,?,?,?,?,?,?,?,?,?) Granted XID :
        
        . The selected victim is XID : 6293.
        
        Thanks!
        
        Permalink
        
        Oct 05, 2010
        
        Delete comments
        
        Karl Wright
        
        I don't know what this is, other than it is clearly a deadlock that Derby is encountering.
        
        My advice to you is the following:
        
        (1) Create a ticket under JIRA: https://issues.apache.org/jira , project name ManifoldCF.
        
        (2) Give me a few minutes to look at the code, but if you cannot wait, you *can* use the quick-start with Postgresql. But if you have a few minutes, wait for my response.
        
        Permalink
        
        Oct 05, 2010
        
        Delete comments
        
        Karl Wright
        
        The code is not executing inside transaction at this point, so it is puzzling how Derby could wind up throwing persistent locks across statements, which is what you need to do to cause deadlock to occur. The only possibility I can think of is that there's a bug in the ManifoldCF Derby database implementation class that's causing transactions to persist. The only other alternative is a bug in Derby itself.
        
        I can find this if it happens enough to trigger the deadlock regularly. How often does this happen to you? I have never seen it before myself.
        
        In any case, please do open a Jira ticket, since this is certainly the wrong forum for problems of this kind.
        
        Karl
        
        Permalink
        
        Oct 05, 2010
        
        Delete comments
        
        Farzad
        
        I opened a ticket,https://issues.apache.org/jira/browse/CONNECTORS-111. So I did a bit more experimenting, and it seems to be caused the first time I run. First time after a new install. I restarted the job and it ended sucessfully.
        
        Permalink
        
        Oct 05, 2010
        
        Delete comments

Farzad

Anythought as to why I get the error messsage trying to unregister an output connector?

C:\Program Files\Apache\apache-cf\dupfinder>.\processes\script\executecommand.b
at org.apache.manifoldcf.agents.UnRegisterOutput DupFinderConnector
The system cannot find the path specified.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/manifoldcf/agents/UnRegisterOutput

I echo'd the command and it is constructed as:

"C:\Program Files\Java\jdk1.5.0_22\bin\java" "-Dorg.apache.manifoldcf.configfile=C:\Program Files\Apache\apache-acf\dupfinder\example\properties.xml" -classpath "." org.apache.manifoldcf.agents.UnRegisterOutput DupFinderConnector

My MCF_HOME is set to C:\Program Files\Apache\apache-cf\dupfinder\example and I'm using PostgresSQL db instead of embedded derby.

Permalink

Karl Wright
Do not try to set the classpath yourself. That's what the script does.

The script is designed to work in the context where there is a "processes" subdirectory that contains all the necessary jars. If the directory you are pointing at doesn't have that setup, you will need to change things if you want to use the script.
- Permalink
- Oct 15, 2010
- Delete comments

Farzad

The updated JDBC driver was the last issue. I successfully crawled 280225 items, thanks for all your help! Now I'm trying to setup a faster configuartion and improve performance. Can I get access to the 30,000 sample set you used to benchmark. I just got my system upgraded with 8 GB of RAM, and my disks are 7200 RPM. I'd like to be able to compare apples to apples.

Permalink

Farzad
Not getting good performance. The system has 8 GB of RAM, two 500 GB disk drives rated at 7200 RPM, Windows 7 64 bit. It improved from the old 5/sec to 9/sec by doing these upgrades. I'd like to get the 31/sec you mentioned. Any thoughts on how to go about this? I don't know if sharing your sample set is problematic or not, do you want me to pull together a set and share that with you. I have some 17 million .eml files that I can use. Appreciate any thoughts or suggestions?
- Permalink
- Nov 02, 2010
- Delete comments
1. Karl Wright
  My test set is the lucene/solr trunk. Check it out using svn and crawl that, including the .svn directories, and see how fast it is for you.
  Permalink
  
  Nov 02, 2010
  
  Delete comments
  1. Farzad
    The trunk of Solr has 14,752 files and 9,269 folders. The job completd in 24 mintues and 11 seconds, or 1451 seconds. I'm getting a rate of 16.6 items / sec. If I use only the files the rate is 10.2 files / sec. Did you use the total count or file count?
    
    Is there a tool we can both use to compare the systems. Do you have ManifoldCF, database, and appserver running off the same disk? I first had the data on the same disk as Manifold, then I moved it to a network drive, and the crawl time went up by 35 seconds.
    
    My only goal at this point is to achieve your results on my system.
    
    Permalink
    
    Nov 04, 2010
    
    Delete comments
    1. Karl Wright
      
      My overall count was larger, because my solr and lucene had been compiled and built. I counted both folders and files in my docs/second calculations. So my system was performing about 2x as fast as yours.
      
      I have everything on the same disk on this system - nothing fancy. It's a Dell Vostro tower, all standard hardware, windows Vista. I believe I posted processor and memory info earlier.
      
      Permalink
      
      Nov 04, 2010
      
      Delete comments

Farzad

Any ideas what this error means and how to fix it? Still performance testing and tuning ...

Exception tossed: Error getting connection
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error getting connection
	at org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:104)
	at org.apache.manifoldcf.core.database.Database.internalTransactionBegin(Database.java:230)
	at org.apache.manifoldcf.core.database.Database.synchronizeTransactions(Database.java:217)
	at org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1079)
	at org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
	at org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:167)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performModification(DBInterfacePostgreSQL.java:586)
	at org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.beginTransaction(DBInterfacePostgreSQL.java:1001)
	at org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:3286)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1848)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1421)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1505)
	at org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1520)
	at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:231)
	at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
Caused by: org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections
	at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:469)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66)
	at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125)
	at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30)
	at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22)
	at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
	at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
	at org.postgresql.Driver.makeConnection(Driver.java:393)
	at org.postgresql.Driver.connect(Driver.java:267)
	at java.sql.DriverManager.getConnection(Unknown Source)
	at java.sql.DriverManager.getConnection(Unknown Source)
	at com.bitmechanic.sql.ConnectionPool.createDriverConnection(ConnectionPool.java:468)
	at com.bitmechanic.sql.ConnectionPool.getConnection(ConnectionPool.java:407)
	at com.bitmechanic.sql.ConnectionPoolManager.connect(ConnectionPoolManager.java:442)
	at java.sql.DriverManager.getConnection(Unknown Source)
	at java.sql.DriverManager.getConnection(Unknown Source)
	at org.apache.manifoldcf.core.database.ConnectionFactory.getConnectionWithRetries(ConnectionFactory.java:144)
	at org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:90)
	... 16 more

Permalink

Karl Wright
The key error is: "Caused by: org.postgresql.util.PSQLException: FATAL: remaining connection slots are reserved for non-replication superuser connections".

You've run out of PostgreSQL connections. How many worker threads do you have? What's the maximum number of connections your PostgreSQL is configured for? The defaults for these values are 30 and 100, respectively - you must have changed one or both of these to get this error.
- Permalink
- Nov 10, 2010
- Delete comments
1. Farzad
  I don't see the worker threads in conf file. The only thing I saw was autovac_max_workers which is commented out. My max_connections = 400, and I create output and repo connectors in MCF that have 100 connections per JVM, so 100 for ouput and 100 for repo.
  Permalink
  
  Nov 10, 2010
  
  Delete comments
  1. Karl Wright
    The worker threads are in ManifoldCF's properties.xml, not postgresql.conf.
    
    Permalink
    
    Nov 10, 2010
    
    Delete comments
2. Farzad
  The instructions on this page say set the max db connections to 400. And I just found it easier to add a 0 to the default 10 in the throttling tab of the connectors.
  Permalink
  
  Nov 10, 2010
  
  Delete comments
  1. Karl Wright
    The problem is that you also need to increase the amount of shared memory buffers for PostgreSQL when you increase the max number of connections. There is a formula in the postgresql.conf file.
    
    If you haven't completely followed the instructions and haven't actually increased the number of worker threads, however, you do not need to set the max number of connections to anything other than their default of 100.
    
    Permalink
    
    Nov 10, 2010
    
    Delete comments
    1. Farzad
      
      I didn't see any instructions talking about the need to increase the number of worker threads. Where is that listed?
      
      Something doesn't add up, reading the conf, so each connection costs 400 bytes, each lock cost 270 bytes. The default max lock is 10 per connection.
      
      10 x 400 x 270 = 1,080,000
      400 x 400 = 160,000
      
      Total shared memory needed = 1,240,000 bytes = which is 1.2 MB, we are setting the shared memory to 1024MB, almost a 1000 more. Why would there be a problem?
      
      The other angle is I configured Postgre to have 400 connection but MCF is set to 30, not even close.
      
      Something is missing
      
      Permalink
      
      Nov 10, 2010
      
      Delete comments
      1. Karl Wright
        
        You do not need to increase the number of worker threads, unless you have a machine capable of massive parallelism, in which case I could have sworn the instructions said it was OK to increase the number of worker threads to around 100. If you do that, you should have a postgresql max connections parameter of 300 or 400.
        
        The instructions for the postgresql parameters are specific to some extent to postgresql 8.x. I don't know what they should be on 9.x. I can only tell you that your symptom is occurring because ManifoldCF is trying to grab more connections that PostgreSQL is willing to give. Since ManifoldCF uses a connection pool, it can only grab one connection per thread.
        
        The only way this would be violated is if a connector was written that uses connection-specific database tables and doesn't properly return handles to the pool after grabbing them. Is there a possibility that this is occurring?
        
        Permalink
        
        Nov 10, 2010
        
        Delete comments
        
        Farzad
        
        Using the default Null Output and File System connectors for this test. I'm running on an 8 processor system, 8 GB of RAM, with 10,000 RPM drives.
        
        So I seem to be in a pickle, cause I have 30 worker threads, and 400 allowed db connections, I'm getting this error. I'm going to set the db back down to 100 and see what happens.
        
        I uploaded my configs to http://www.farzad.net/manifoldcf in case I over looked something.
        
        Oh, the other thing, this happens around doc count of 60,000. Have you tested with a very large test set, perhaps 250,000 or 500,000?
        
        Permalink
        
        Nov 11, 2010
        
        Delete comments
        
        Karl Wright
        
        It's been a while since this was done with the ManifoldCF code base, but the MetaCarta code base on which it is built regularly crawls 5 million or more without any issues of this kind. So the possibilities are:
        
        (1) Misconfiguration of PostgreSQL
        
        (2) Something about PostgreSQL 9.x.
        
        (3) Something broken in the ManifoldCF code that got broken sometime in the last few months.
        
        The obvious cross-check is for someone else (probably me) to do a large crawl while using a properly configured PostgreSQL 8.x, and see where that leads. Unfortunately I probably won't have time to do that until next week, unless I can squeeze in a few moments tonight.
        
        Permalink
        
        Nov 11, 2010
        
        Delete comments
        
        Farzad
        
        In the meantime, I'm going to install 8.x to reduce the variables and see what happens.
        
        Permalink
        
        Nov 11, 2010
        
        Delete comments
        
        Farzad
        
        It failed again. So I think I ruled out 1 & 2. I uninstalled 9.0, rebooted, installed 8.3.12 (also reverted back to the original jdbc driver). Didn't make a single change to any of the configs, since I have not changed Manifold's worker count. It failed after 63,953 document, 35,355 active, and 29,174 processed.
        
        If you have any other ideas for me to try, I'd be more than happy to. At this point, I think we are looking at number 3. I am running the example setup with the provided connectors, nothing custom.
        
        Permalink
        
        Nov 11, 2010
        
        Delete comments
        
        Karl Wright
        
        I've looked at the default pool size. Without anything in the properties.xml, the default is 200 handles, which is bigger than postgresql's default of 100. (It, however, is reasonable if you set PostgreSQL maximum connections to 400.)
        
        What I'd like you to try is to set the parameter "org.apache.manifoldcf.database.maxhandles" in your properties.xml to a value of "50". This is below PostgresSQL's default, so you should not run out of postgresql handles. I'm trying the same thing right now.
        
        Permalink
        
        Nov 11, 2010
        
        Delete comments
        
        Farzad
        
        Well my system is still running but very slow. It's been almost 24 hours and it has only crawled 243297 items. This is on the 8 processor system with 10K disks (120 MB reads, and 200 MB writes). I'll let it finish, should I try increasing connections and workers on the next go around?
        
        Permalink
        
        Nov 12, 2010
        
        Delete comments
        
        Karl Wright
        
        It is clear then that your attempt to set up PostgreSQL with 400 database handles did not actually succeed, or my recommendation would not have helped.
        
        The performance is still very poor compared with my very cheap system, but your disks now look reasonably quick. So let's try to figure out the problem.
        
        (1) The default of 30 threads sounds low for your system. I'd up this to 100.
        
        (2) You don't want the maximum connections to be a bottleneck, either on the repository connection side or on the output connection side. Set the max connections for both to 105.
        
        (3) Configure your postgreSQL to have at least 200 database handles available. I know you tried to do this already, but for some reason you configuration did not work.
        
        (4) Set your properties.xml maximum database connections parameter to be 105, so that's not a bottleneck either.
        
        (5) You may want to give the JVM more memory than the default. Perhaps you are garbage collecting too much. If you are still using the quick-start, just add appropriate -Xmx and -Xms commands. I'd start with 1024MB.
        
        If none of this helps, then we can figure out what the bottleneck is by getting a Java thread dump while the crawler is active. How you do this depends on what operating system you are using. But it should be possible from that thread dump to get an idea where all the threads are waiting. Post it to connectors-user@incubator.apache.org.
        
        Thanks,
        Karl
        
        Permalink
        
        Nov 13, 2010
        
        Delete comments
        
        Farzad
        
        Made the changes, system is still having problems, running but very slow. I started the example with "java -Xms512m -Xmx102
        4m -jar start.jar". Got two java dumps at 40K and 80K, emailed them along with my config files to connectors-user@incubator.apache.org.
        
        Permalink
        
        Nov 15, 2010
        
        Delete comments

Child pages

38 Comments

Anonymous

Karl Wright

Anonymous

Karl Wright

Anonymous

Karl Wright

Farzad

Karl Wright

Farzad

Farzad

Karl Wright

Farzad

Karl Wright

Karl Wright

Farzad

Farzad

Karl Wright

Farzad

Farzad

Karl Wright

Farzad

Karl Wright

Farzad

Karl Wright

Farzad

Karl Wright

Farzad

Karl Wright

Farzad

Karl Wright

Farzad

Karl Wright

Farzad

Farzad

Karl Wright

Farzad

Karl Wright

Farzad