Frequently asked questions

Security model

Q. What exactly are the ACCESS_TOKEN and DENY_TOKEN values that are sent to an output connector, and presumably stored in the index?

A. The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings that represent a contract between an ManifoldCF authority connection and the ManifoldCF repository connection that picks up the documents (from wherever). These tokens thus have no real meaning outside of ManifoldCF. You must regard them as opaque.

The contract, however, states that if you use the ManifoldCF authority service to obtain tokens for an authenticated user, you will get back a set that is CONSISTENT with the tokens that were attached to the documents ManifoldCF sent to Solr for indexing in the first place. So, you don't have to worry about it, and that's kind of the idea. So you imagine the following flow:

1. Use ManifoldCF to fetch documents and send them to Solr
2. When searching, use the ManifoldCF authority service to get the desired user's access tokens
3. Either filter the results, or modify the query, to be sure the access tokens all match up properly

For the AD authority, the ManifoldCF access tokens consist, in part, of the user's SIDs. For other authorities, the access tokens are wildly different. You really don't want to know what's in them, since that's the job of the ManifoldCF authority to determine.

ManifoldCF is not, by the way, joined at the hip with AD. However, in practice, most enterprises in the world use some form of AD single signon for their web applications, and even if they're using some repository with its own idea of security, there's a mapping between the AD users and the repository's users. Doing that mapping is also the job of the ManifoldCF authority for that repository.

Q. What is the relationship between stored data (documents) and authority access/deny attributes? Do you have any examples of what an access_token value might contain?

A. Documents have access/deny attributes; authorities simply provide the list of tokens that belong to an authenticated user. Thus, there's no access/deny for an authority; that's attached to the document (as it is in real-world repositories).

Let's run a quick example, using Active Directory and a Windows file system. Suppose that you have a directory with documents in it, call it DirectoryA, and the directory allows read access to the following SIDs:

S-123-456-76890
S-23-64-12345

These SIDs correspond to active directory groups, let's call them Group1 and Group2, respectively.

DirectoryB also has documents in it, and those documents have just the SID S-123-456-76890 attached, because only Group1 can read its contents.

Now, pretend that someone has created an ManifoldCF Active Directory authority connection (in the ManifoldCF UI), which is called "myAD", and this connection is set up to talk to the governing AD domain controller for this Windows file system. We now know enough to describe the document indexing process:

  • Each file in DirectoryA will have the following _ALLOW_TOKEN_document attributes inside Solr: "myAD:S-123-456-76890", and "myAD:S-23-64-12345".
  • Each file in DirectoryB will have the following _ALLOW_TOKEN_document attributes inside Solr: "myAD:S-123-456-76890"

Now, suppose that a user (let's call him "Peter") is authenticated with the AD domain controller. Peter belongs to Group2, so his SIDs are (say):

S-1-1-0 (the 'everyone' SID)
S-323-999-12345 (his own personal user SID)
S-23-64-12345 (the SID he gets because he belongs to group 2)

We want to look up the documents in the search index that he can see. So, we ask the ManifoldCF authority service what his tokens are, and we get back:

"myAD:S-1-1-0", "myAD:S-323-999-12345", and "myAD:S-23-64-12345"

The documents we should return in his search are the ones matching his search criteria, PLUS the intersection of his tokens with the document ALLOW tokens, MINUS the intersection of his tokens with the document DENY tokens (there aren't any involved in this example). So only files that have one of his three tokens as an ALLOW attribute would be returned.

Note that what we are attempting to do in this case is enforce AD's security with the search results we present. There is no need to define a whole new security mechanism, because AD already has one that people use.

Q. Do the ManifoldCF authority connections authenticate users?

A. The authority connectors don't perform authentication at this time. In fact, ManifoldCF has nothing to do with authentication at all - just authorization. It is almost never the case that somebody wants to provide multiple credentials in order to be able see their results. Most enterprises who have multiple repositories authenticate against AD and then map AD user names to repository user names in order to access those repositories. For a pure-java authentication solution, we are currently recommending JAAS plus sun's kerb5 login module (com.sun.security.auth.module.Krb5LoginModule) for handling the "authenticate against AD" case, which covers some 95%+ of the real world authentication needed out there. We may have more complete recommendations in the future.

Q. I have a question regarding how multiple identifiers for a given user is handled in the authority service. Let say that I want to get the access tokens for the user John Smith against all the authority connectors defined in ManifoldCF. Let say that John is known as john.smith in AD, known as j.smith in document and so on. If I'm not wrong, the only parameter used to identify a user in the authority service is "username". I'm wondering how user id reconciliation is performed inside the authority service in that case? Is there something done about that or is it a work that should be performed externally?

A. The user name mapping is the job of the individual authority. So, for example, the Documentum authority would be responsible for any user name mapping that would need to be done prior to looking up the tokens for that user within Documentum, and the LiveLink authority needs to do something similar for mapping to LiveLink user names.

It turns out that most enterprises that have coexisting repositories of disparate kinds make an effort to keep their user name spaces consistent across these repositories. Otherwise, enterprise-wide single signon would be impossible. In the cases where the convention for mapping is ad-hoc (e.g. LiveLink), the authority connectors included with ManifoldCF were built with a simple regular-expression-based mapping feature, which you get to configure right in the crawler ui as part of defining the authority connection.

Many repository companies also have added AD synchronization features as their products have matured. Documentum is one such repository, where the repository software establishes a feature for operating with AD. For those repositories, we did not add a mapping function, because it would typically be unnecessary if the repository integrator followed the recommended best practices for deploying that repository.

Q. I don't like the idea of storing document access tokens in an index. What happens if/when you want to add explicit user access to some group of documents? (i.e. not via a group)

A. In ManifoldCF, you would change the permissions on the appropriate resource, and then you run your ManifoldCF job again to update those permissions. Since ManifoldCF is an incremental crawler, it is smart enough to only re-index those documents whose permissions have changed, which makes it a fairly fast operation on most repositories. Also, in my experience, this is a relatively infrequent kind of situation, and most enterprises are pretty resilient against there being a reasonable delay in getting document permissions updated in an index.

However, if this is still a concern, remember that your main alternative is to go directly to the repository on every document as you filter a resultset. That's slow for most situations. Performance might be improved with caching, but only if you knew that the same results would be returned for multiple queries. So no solution is perfect.

Q. I don't like the idea of storing document access tokens in an index. What happens if you need to revoke a users rights, or change a user's group affinity?

A. The access tokens for a user are obtained from the authorities in real time, so there is no delay. Only access tokens attached to documents require a job run to be changed.

Q. I don't like the idea of storing Active Directory SIDs in an index. They might be changed.

A. Once again, this is a very infrequent occurrence, and when it does happen, ManifoldCF is well equipped to handle the re-indexing in the most efficient way possible.

Q. How has ManifoldCF performed (the example configuration) on what kind of hardware?

A. The example, running on Derby, has not had performance tests run against it. The example running with PostgreSQL 8.3 on a Dell laptop with disk encryption is capable of doing a file system crawl at 35 documents/second. A real server will, of course, run significantly faster. At MetaCarta, we discovered that almost always the repository being crawled was the bottleneck. Only exceptions are RSS and Web crawls.
On a crawl that is executing optimally, the system will be CPU-bound. If you are seeing low rates of CPU utilization, it may mean you have inadequate disk performance. There are also known bugs with Derby that result in the Derby database deadlocking and recovering, which also leads to very poor system utilization.

Q. How do I use PostgreSQL with the quick-start example?

A. First, install PostgreSQL, and remember the superuser database name and password (usually "postgres" is the name). Then, change the properties.xml file in the following way:
Change:
<property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfaceDerby"/>
to:
<property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>
Add:
<property name="org.apache.manifoldcf.dbsuperusername" value="postgres"/>
<property name="org.apache.manifoldcf.dbsuperuserpassword" value="*******"/>

Then, start the quick-start example normally, and everything should initialize properly.

Q. How do you configure Eclipse to build the ManifoldCF project?

A. Here are the steps using Eclipse 3.4:

  1. Install Subclipse for Eclipse 3.x, follow the steps from http://subclipse.tigris.org/servlets/ProjectProcess?pageID=p4wYuA
  2. In Eclipse, switch to the "SVN Repository Exploring" perspective
  3. Add a new SVN repository using the URL http://svn.apache.org/repos/asf/incubator/lcf/trunk
  4. Right click on the svn repo and select "Check Out"
    • May want to change the default name from truck to ManifoldCF, if you don't change the name Eclipse will ask for a project type, pick General/Project.
  5. Wait for the source to extract
  6. Switch to Java Perspective and right click on the project that was added (referred to as MCF in the rest of the steps) and select "Properties"
  7. Select "Builders" and click New
  8. Select "Ant Builder" and click Ok
  9. Give your builder a name, like ManifoldCF Ant Builder
  10. In the "Buildfile" section, press the "Browse Workspace" button
  11. Select the MCF project, drill down to "modules" subfolder and select "build.xml" file then press Ok
  12. In the "Base Directory" section, press the "Browse Workspace" button
  13. Expand the MCF project and select "modules" then press Ok
  14. Note, you can further configure the different targets if you wish for a clean, regular, and auto build
  15. Press Ok in the "Edit launch configuration properties" to complete the Eclipse configuration
  16. Make sure you have the system variable JAVA_HOME pointing to your jdk, also you need the jdk bin directory listed in your path so java doc would work
  17. Now you can issue "Project/Build Project" and watch the console for the ant output

The build will also run through the junit tests which increases the build time. For those who like to do incremental build as they code, you may want to configure a "build" target without the final unit test ("run-tests" in "all" target), which reduces build time from 5 to 1 minute.

Q. What is the proper setting for number of worker threads?

A. The number of worker threads, number of delete threads, number of expiration threads, database pool size, and maximum number of database handles (in PostgreSQL) are related as follows for the Quick Start:

(num_worker + num_delete + num_expiration + 10) < database_pool_size <= maximum_database_handles - 2

The formula is somewhat different if you have multiple ManifoldCF processes, e.g. you are running the crawler separately from the web applications. In that case you need to add up ALL the processes, because each of them will have their own pool of the designated size:

database_pool_size * num_processes <= maximum_database_handles - 2

The overall idea is so you don't run out of database handles in the pool (which can cause ManifoldCF to deadlock even), and you don't run out of real database handles either (which will cause a database error that stops your jobs). The value of "2" adjustment is simply so you can get into the database while ManifoldCF is running using tools like psql, and can do things like vacuuming.

The first four values are all properties you can (and should!) set in properties.xml. They are described in the "how to build and deploy" document on the site. The last property requires you to configure the database (probably postgresql). There are also general instructions for doing that in "how to build and deploy".

The relationship between worker threads and all of the other kinds depends on your usage. Generally, though, 10 expiration threads and 10 deletion threads are fine, since they do less of the overall work involved.

Q. How can I use the Quick Start example with PostgreSQL?

A. All you have to do is edit the quick start's properties.xml file as follows:

  1. Change the property "org.apache.manifoldcf.databaseimplementationclass" to have a value of "org.apache.manifoldcf.core.database.DBInterfacePostgreSQL".
  2. Add a property "org.apache.manifoldcf.dbsuperusername" that has a value that is the name of your PostgreSQL super user.
  3. Add a property "org.apache.manifoldcf.dbsuperuserpassword" that has a value that is the password for your PostgreSQL super user.
  4. Change the property "org.apache.manifoldcf.crawler.threads" to have a value consistent with your PostgreSQL configuration.
  5. Change the property "org.apache.manifoldcf.database.maxhandles" to have a value consistent with your PostgreSQL configuration.

Then, just run the Quick Start normally, and it will create the database instance within PostgreSQL instead of within Derby.

Q. How can I connect to the Derby instance when the Quick Start is running?

A. Sometimes it is very useful to be able to look into the Derby database while the ManifoldCF Quick Start is active. All you have to do to set this up is as follows:

  1. Start Quick Start using "java -Dderby.drda.startNetworkServer=true -jar start.jar".
  2. Start the Derby ij tool from the same directory, using "java -cp lib\derbyclient.jar;lib\derbytools.jar org.apache.derby.tools.ij", or the Unix equivalent.
  3. In ij, connect to the database using the command "connect 'jdbc:derby://localhost:1527/dbname';".

The Derby ij command will then let you perform whatever query you like.

Supported Documentation Platforms

Q. Is there support planned the the Atlassian suite? (Confluence, JIRA, Crucible, Bamboo)

A. This is one of a class of questions, namely "are you currently planning to add a connector for X". Open source software is like pot-luck; the more you bring to it, the more you'll get out of it. ManifoldCF is designed to make it straightforward to write new connectors, and contributions of all sorts are strongly encouraged. Even if you aren't sure you can develop a full connector on your own, folks involved with the project are happy to help you. There is also a book being written, ManifoldCF in Action, which has as one of its goals getting people to the point of being able to write their own connectors. Parts of it are available already - you can check it out here: http://www.manning.com/wright

To answer the specific question, connectors have been requested for the following:

  1. Atlassian
  2. CMIS
  3. Oracle with OLS
  4. The generic Content Management Java API spec, JSR whatever-it-is
  5. Enhanced SharePoint 2010, with site discovery

The only one I'm aware of that is being worked right now is SharePoint 2010 with site discovery. No current plans exist to implement connectors of other stripes, because none of the committers have access to such systems at this time. If you need such a connector, and you have access to such a system, you are strongly encouraged to join the connectors-user list, post your query there, and then maybe we can work out a development plan.

Solr integration

Q. How do I extract and index the contents of documents such as MS Word using Solr 1.4.1? I'm getting a lazy loading error.

A. There are a couple of bugs related to Solr 1.4.1 which makes it difficult to parse such documents. Before you try the workarounds below, you should read more about the ExtractingRequestHandler in order to understand the underlying technology. This handler uses Apache Tika to extract contents from a broad variety of document formats.

Solr 1.4.1 ships with version 0.4 of Tika which will only allow you to extract the document's metadata, not the content itself. Therefore you need to upgrade Tika to version 0.8.

Generally, you have two options. You may get the latest version of Solr from trunk which ships with Tika 0.8 at the time of writing, but this is not recommended if you plan to use Solr in a production environment. Alternatively, you may download the following branch instead which also includes version 0.8 of Tika:
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/

In either case it is recommended to download the latest version from trunk anyway since you need some updated libraries, even thought you choose the branch version (download both versions in that case).

If you preferred to use the latest version from trunk, you're done. Otherwise you need to complete a few more steps. First, step into the contrib/extraction directory and type ant in order to build the Solr Cell jar file. Then copy it to a dedicated directory intended for external libraries which Solr requires, for example to <solr_home>/lib. Remember to specify this folder in your solrconfig.xml file so Solr knows where to look for external libraries. You will find sufficient information about this inside the configuration file.

Finally you need updated Tika dependencies such as PDFBox. The latest version from trunk should include sufficient updated libraries, so just copy all the jar files located in the contrib/extraction/lib folder and place them into your external library folder.

If you also need to specify different date formats as described in the ExtractingRequestHandler documentation, you must install the following patch as well:
https://issues.apache.org/jira/secure/attachment/12434831/SOLR-1756.patch

  • No labels

78 Comments

  1. How has ManifoldCF performed (the example configuration) on what kind of hardware?

    1. I'm running on a Dell mini tower with intel core 2 @ 1.8 GHz with 2 GB of Ram, not even coming close to 35 docs/second. It has Windows XP SR3 as the OS. In fact the JVM just crashed and I opened up a bug report with Sun. What were the specs on the Dell laptop. I'm trying to figure out the problem with my system.

      Second question, switching the example from using Derby to Postgres. I see where in the properties.xml I need to change the value of org.apache.manifoldcf.databaseimplementationclass from DBInterfaceDerby to DBInterfacePostgreSQL. I also need to supply the username and password. I keep getting ManifoldCFException: Error getting connection. What am I missing? Is there documentation for this I missed?

      <property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>
      <property name="org.apache.manifoldcf.database.username" value="postgres"/>
      <property name="org.apache.manifoldcf.database.password" value="*******"/>

      Thanks!

      1. This is a Dell M2400 with dual-core 2.8Ghz CPU and 3.5Gb Ram. Remember that for the benchmark I was using Postgresql. You also probably will want to increase the max connections parameter for both your repository connection and output connection to something > 30.

        Your database parameters look correct. Can you log into psql using the credentials you supplied?

        Karl

        1. Actually, I misspoke.

          The jetty example does the following, which works find under Derby but will fail under PostgreSQL:

          ManifoldCF.createSystemDatabase(tc,"","");

          The database parameters supplied are the credentials it will create the primary database user with, not the superuser credentials.

          I am going to add an optional parameter pair to deal with this in the example. Stay tuned.
          Karl

          1. The new parameters are:
            public static final String databaseSuperuserName = "org.apache.manifoldcf.dbsuperusername";
            public static final String databaseSuperuserPassword = "org.apache.manifoldcf.dbsuperuserpassword";

            You can set these in your parameters.xml file, or alternatively use -D switches.

            1. Thanks, it worked, I built yesterday and ran the code. I'm getting 5 docs/sec on my system, Dell mini tower with intel core 2 @ 1.8 GHz with 2 GB of Ram, Windows XP SR3. Is there something I can do to improve performance, it seems too low, since this is the only thing running. I'm running with PostgreSQL DB, and not derby.

              1. I will try to run a new batch of performance tests tonight or tomorrow, and get back to you either way.

                1. I'm not sure what yet the problem might be under PostgreSQL - I haven't yet tried that.
                  However, I did try to do a performance run under Derby. Derby runs well for a time but then stalls - it seems to wind up with that same internal deadlock I coded around before. It doesn't error out at that point anymore, but since it stalls for an entire minute before it detects the deadlock, you lose a lot of time and performance as a result.

                  I'm not quite sure what to do about about this issue with Derby. Maybe I can lower the deadlock timeout threshold; I'll have to think about that for a while. I'll look at PostgreSQL next.

                  1. With PostgreSQL, a somewhat different test set than I used in May, and with a no-doubt much more fragmented disk, I am getting some 17 documents/second here, now, doing a file-system crawl to a null output. Which is 1/2 what I saw in May.

                    This had the following special postgresql settings:
                    (1) 100 max connection handles
                    (2) 256MB shared buffers (which may well have been overkill, but that's what my PostgreSQL setup had)

                    Connection/job settings:
                    (1) 100 max connections of both repository amd output connections.
                    (2) Hop filters set to "never delete unreachable documents".

                    System was pretty near totally I/O bound during execution, which leads me to believe that, since the system was brand-new in May, disk fragmentation was a major factor. I will try to run a benchmark where the database is on a different disk than the files being crawled, maybe today.

                    1. Another difference I just discovered. The test set I just used has a significant percentage of large files (>1 MB), maybe perhaps 25% or so, while the test set in May was all machine-generated with smaller files (25K), so that too would account for some significant disk-related performance differences.

                      So I think it's fair to say that, if you are seeing 5 docs/second, another thing you should check is whether you are crawling off the same disk your database is on, and how fast those files can be retrieved by any means.

                      I'm about to try the same crawl with a Dell Tower that has a reasonably fast disk, stay tuned.

                    2. FWIW, my Dell tower (a Vostro 220) which has a similar 2.80Ghz dual-core processor, but a much faster disk, clocks the same test at 31 docs/second, this time almost totally CPU bound. So my guess is your system's disk performance is very poor for some reason.

                      1. I am crawling on a different system than where the files are. My disk performance is very poor, 40MB/sec. I have another system with 300MB/sec. Moving the agent there to see the difference. I ran NBench from http://www.acnc.com/04_02_02.html, if you are curious. It said an NT tool, but ran fine under XP, Vista, and 7.

                        1. You really need to run your database on a fast disk if you want decent performance from MCF.

                          1. I'm going to get faster disk. Just curious how you define fast? Disk, Controller, etc.

                          2. How big was the file share you crawled? I have 280,000 files spread across a lot of directories. It starts out 29-31 docs / sec, but as it crawls it gets slower. For example, at 98000, it was doing 31 docs a second, at 203000 it is doing 16 docs a second. So I'm just curious how long have you been able to sustain the ~30 doc/sec.

                            1. The sample I used was some 30,000 documents.

                              Several effects come into play for larger, more extended crawls.  PostgreSQL accumulates "dead tuples" over time, which impact performance.  There is a procedure for cleaning this up, which I believe is documented in the "Build and Deploy" page, involving a VACUUM FULL operation.

                              Second, if you use PostgreSQL's configuration out of the box, you are likely getting a background VACUUM operation starting at some point during your crawl.  This background-process vacuum is insufficient to keep up with dead tuple accumulation and only serves to slow things down.  So turn "autovacuum" to OFF.  This is also mentioned in the build-and-deploy page.

                              Third, ManifoldCF itself periodically asks PostgreSQL to reindex data, which can have an overall impact on performance.  The time at which it performs this activity is every 100,000 inserts/modifies to the queue.  That is obviously more than the size crawl I ran.

                              Hope this answers your question.

                              1. I should also mention that, in tests done at MetaCarta, the maintenance activities performed on PostgreSQL were sufficient to restore performance to its original level, no matter how large the queue becomes. The only issue is degradation vs. the time taken to perform the maintenance. This is, of course, a tradeoff.

                                Also please bear in mind that the maintenance schedule that we eventually arrived at was specific to the particular usage patterns of MetaCarta clients, so your mileage will vary. One ticket that we should probably create is to allow the internally-tracked changes, e.g. the reindex time, to be set by property. I'll create that ticket now.

                                1. Thanks for the details, it helps paint the picture. Autovacuum was already off when I ran. So it must be the reindexing. From what I read, the dead tuples impact storage not performance, so it might make sense to reindex at the end of the job as oppose to a count.

                                  1. I know that's what PostgreSQL documentation says, but our experience was different.

  2. Can you please explain the use and format of this line?

    public static final String _rcsid = "@(#)$Id: NullConnector.java 988245 2010-08-23 18:39:35Z kwright $";

    I see where _rcsid is being used to generate html or java script.

    What is 988245?

    What is Z after the time stamp?

    Is $Id or @(#) replaced with anything?

    1. This is functionality of svn.  The $Id$ string is replaced by svn when the source file is checked out.  It's simply used for tracking.  There should be NO functional use of _rcsid in any code, anywhere.

  3. Anonymous

    How do I configure Tomcat inside Eclipse to get ManifoldCF running? Especially which '-D commands' I have to use.
    A short step by step instruction will be helpful.

    regards Johannes

    1. Have you read the "how-to-build-and-deploy" page? It's on the ManifoldCF site, under Documentation. Here's the link:

      http://incubator.apache.org/connectors/how-to-build-and-deploy.html

      1. Anonymous

        First, thx for the fast reply!
        Sure I have read the "howTo". But if I build Manifold and follow the steps:

          • Deploy the war files in <MCF_HOME>/web/war to your application server.
          • Set the starting environment variables for your app server to include the -D commands found in #<MCF_HOME>/web/define. The -D commands should be
        1. of the form, "-D<file name>=<file contents>".

        There is no such directory "define", so I can't guess which -D commands I have to use.

        Also this steps "only" describes how to use Tomcat without Eclipse. In my understanding I have to build Manifold manually each time inside Eclipse and after that I can restart Tomcat. Is there an easy way to get an automatically build each time I start Tomcat inside Eclipse?
        I'm used to have an Eclipse-Web-Project and add this to Tomcat so a deployment is done automatically. Maybe is there a web based SVN-Checkout?

        Sorry if my questions sounds tivial, but I'm a bit confused to set up manifold and Tomcat within Eclipse correct.

        1. The instructions also say, "if any", with respect to the define directory. If the directory is not present, then there are no defines from it.

          You will have to include a define statement to point at the configuration area, but that is it.

  4. Anonymous

    Hello Everybody,

    I am very impressed by this project. It is the best crawler that i have tried.
    I would like to make it run inside eclispe, in order to understand more clearly how the framework work and used the debug mode (if possible).

    Someone can explain me how configure eclipse for developper environement.

    Thanks

    1. The flattery is very welcome. But it will need to be someone else who is more eclipse-savvy to answer the eclipse question.

      On the other hand, your reference to "debug mode" is interesting. I suppose you mean the logging options that enable logging for various subsystems? These are all set as properties in the properties.xml file. There are a number offhand I remember, such as:

      org.apache.manifoldcf.db
      org.apache.manifoldcf.threads
      org.apache.manifoldcf.perf

      There are quite a few others; you can find the complete list by looking in the following classes:

      org.apache.manifoldcf.core.system.Logging
      org.apache.manifoldcf.agents.system.Logging
      org.apache.manifoldcf.crawler.system.Logging
      org.apache.manifoldcf.authorities.system.Logging

      Set these to "DEBUG" if you want debugging info. Their default setting is "WARN".

      Karl

  5. My girlfriend said, "you must tell me when you like a meal I prepared"... And I think she is right.
    it is important to tell people you like or not what they made. So they can improve.
    I come from PHP, c word... I used to make code (Ada) for spacecraft.

    I am quite new to JAVA + JSP and eclipse, and I would like to understand how the framework work. I am looking for how to start in this environement I don't know. Moreover I a not familiar with eclipse. I used often notpad or vi.

    If you have suggestion about getting started with this project, It would be very nice.

    Nicolas

    1. There's quite a bit of online documentation for both Java and JSP. A judicious online search will find you a lot of resources. There are also numerous books, such as "Java in a Nutshell", and no doubt similarly concise books on JSP.

      As far as ManifoldCF is concerned, have you looked at the online documentation? There's quite a bit there:

      http://incubator.apache.org/connectors/

      If you are mainly interested in the ManifoldCF framework internals, there's a book coming out, from Manning Publishing, called ManifoldCF in Action. Early parts of the book should be available soon for a reasonable fee via the Manning Early Access Program. I cannot say precisely when this will appear, but it could be in as little as two weeks.

      Thanks,
      Karl

      1. I am interested by ManifoldCF framework internals. I read Lucene in action, and Tika in action. I hope the book will be like these 2.

  6. Sometime job never end, is it normal? For example, I was crawling the website http://www.cpa-gso.com. The Schedule type was "Scan every document once".
    But after some crawling, the status of the job was always Running and Documents,Active,Processed don't change (389,98,299).
    Moreover il i try to abort the job, the job never abord.

    Do i have done something wrong?

    1. Some questions:
      First, are you using a single-process deployment, or a multi-process deployment?
      Second, is there anything in the manifoldcf.log file?
      Third, are you using Derby as the database, or PostgreSQL?

      Thanks,
      Karl

      1. Hello again

        1) single-process deployment
        2) there is nothing in manifoldcf.log
        3) Derby is used.

        I try it on 2 OS (Win7 and Ubuntu)
        I am using the Release Candidate 8

        I can give vou access to the server if It can help.

        Thanks,
        Nicolas

        1. It would be best to use the connectors-user list for discussions of this kind.

          The most likely cause is a Derby deadlock. Derby has a number of oddities, and it can actually deadlock against itself upon occasion. My suggestion would be to try the same crawl with Postgresql 8.4.x as the database. But before you do that, can you look at a Simple History report, to see what the last even that was recorded was? It could also be a case of the site you are crawling responding to some of the document requests with some signal that tells the Web connector that it should retry later.

          Meanwhile, I'd be happy to try your crawl here, using Postgresql, if you can view your job, and include a snapshot of that view.

          Thanks,
          Karl

        2. I've kicked off a test run here with PostgreSQL and the standard throttling parameters. It's taking a while, of course, due to the throttling. But I'll let you know what happens.

          1. With ManifoldCF-0.1-incubating, RC8, running with PostgreSQL, the job completed as follows:

            Start test Done Sun Jan 30 16:45:40 EST 2011 Sun Jan 30 17:18:43 EST 2011 384 0 384

            Karl

            1. When I tryed to go to "Simple History" a error occur. It seems to be a Derby database probleme as expected. The error message begin with (manifoldcf.log):

              org.apache.jasper.JasperException: An exception occurred processing JSP page /resultreport.jsp at line 766||763: ??BucketDescription idBucket = new BucketDescription(reportBucketDesc,false);|764: ??BucketDescription resultBucket = new BucketDescription(reportResultDesc,false);|765: |766: ??IResultSet set = connMgr.genHistoryResultCodes(reportConnection,criteria,sortOrder,resultBucket,idBucket,startRow,rowCount+1);|767: |768: %>|769: ??<input type="hidden" (name="clickcolumn" value=""/>|||Stacktrace:...

              I will try the same test with PostgreSQL. 

              1. I tried also starting a crawl with Derby, and so far it seems OK. I don't get any errors when I look at the simple history, yet. So I'm not quite sure what's going wrong for you.

                There is a derby.log file which might have some clues in it.

                1. It is strange.
                  I made my test 2 OS (Win and ubuntu server 10.04). The ubuntu was a fresh install.
                  I tried to craw 2 different web site.
                  And I have the same result. The craw is stuck after 5 minutes of crawling.

                  Maybe it is the JDK version?
                  Maybe it is my crawling parameters are not right? ( but I used default ).

                  derby.log is empty too...

                  Thank for help.
                  Nicolas

                  Ps: I just click on abord, and I have the following picture since 10 minutes.

                  1. My Derby crawl did not finish either. Even worse, it locked up badly enough so that I had to stop and restart the crawler process to get into the UI. So clearly there's a database issue of some kind.

                    You can always safely restart the process; when you do that, the job should complete aborting. But that's immaterial, because Derby seems to be deadlocking internally in some way during the crawl.

                    I strongly urge that you give up on Derby and use PostgreSQL. The performance is better and it won't deadlock in this way.

                    Meanwhile, I'm going to try and figure out what Derby thinks its doing when it hangs in this way.

                    1. Yes I had the same issue ( I had to stop and restart the crawler process to get into the UI)

                    2. Derby has a huge number of threads waiting on being able to modify or update a number of table rows. Meanwhile, one thread seems to be alive and is updating hop count data.

                      My suspicion is that there is indeed a bug in Derby, but it's the hopcount logic's intensive use of database access that exacerbates the problem. I suspect that jobs that do not deal with hopcount at all will not fail readily.

                      In any case, you can easily change the Quick-Start's properties.xml file to use PostgreSQL instead. The online build-and-deploy doc explains how.

                        1. It turns out someone may have seen this before, but I could never reproduce it. Thanks to you I have a better grasp on the conditions needed to make this happen. See CONNECTORS-100, because you will likely want to modify your crawl parameters somewhat to avoid this issue.

                          The two parameters to consider changing are:

                          (1) Exclusion of images. It's the shared images that are causing the performance degradation.

                          (2) Disabling hop-count tracking, if you aren't using it. This can be done by clicking the bottom radio button on the hop count tab of the job.

                          Thanks!

  7. hello again,

    I have configure manifoldcf to work with postgreSQL.

    <property name="org.apache.manifoldcf.databaseimplementationclass" value="org.apache.manifoldcf.core.database.DBInterfacePostgreSQL"/>
    <property name="org.apache.manifoldcf.dbsuperusername" value="bahout"/>
    <property name="org.apache.manifoldcf.dbsuperuserpassword" value="*******"/>

    but after sudo java -jar start.jar

    I have the folowing error :

    Configuration file successfully read
    org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: syntax error at or near "$1"
    Configuration file successfully read
    org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: Exception doing query: ERROR: syntax error at or near "$1"
    at org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:461)...

    I supposed it means that the database has not be created but why?

    ( I can create manually a database)
    Thanks

    Nicolas

    1. I've seen this before and it is supposedly fixed in RC8. But apparently this is incorrect.

      The $1 is something internal to PostgreSQL, and probably represents a bug.
      The error you are seeing appears intermittently; I've been able to create the database without issue, but sometimes it does not seem to work.

      The easy way around is to create the postgresql user yourself. Use psql -U <superusername> to get into the master database, and do the following:

      create user manifoldcf password 'local_pg_passwd';
      \q

      After that, ManifoldCF should start just fine.

      1. I believe I've fixed this latest problem on trunk, although since it has failed intermittently in the past, I cannot be certain.

        I'm debating whether to put together an RC9, or just document the need to create the user by hand in manifoldcf-0.1-incubating.

        1. I've reopened CONNECTORS-148 to track this problem.

          1. Everything is working for me thanks.
            I have creat a database as you suggested.

            Nicolas

  8. Hello everybody, I am trying to understand the framwork. I have tried to used java virtual machine, in order to run in debug.
    I can see the threat running, but I have the following error Source not found.
    did someone know how to solve this issue?

    Thanks

  9. Hi Karl,
    Back in the saddle with the project. First off is there another location for asking these questions since the project was moved to Apache?

    My main question, I tried running the crawler and pointing to a Postgres DB on another system and it seems to ignore it. I know I was using the right file, casue if I modified the password parm it would hang. I was using the following in the properties.xml file:

    <property name="org.apache.manifoldcf.postgresql.hostname" value="valaddev"/>

    From a command line, I can ping the system named "valaddev". I also modified the pga_hba.conf file to allow connections. I used PGAdmin III installed on the working maching to test connecting to the remote db running on valaddev. Am I doing something wrong or a bug?

    Thanks!

    1. I would join the list connectors-user@incubator.apache.org, and ask your questions there. This is a wiki page, after all.

      I'm curious as to why you think the property is being ignored? What is the symptom? What is the version of PostgreSQL? You may need to update the postgresql driver in ./lib to get it to work if it's new enough.

      Karl

      1. I'm part of the connectors-user email group. I was looking for a place where knowledge can grow without me keeping the emails. Is there such a place?

        1. The news lists are in fact kept around; you can in fact use google to find old posts. Try googling "ManifoldCF eclipse" to see what I mean.

          1. Found the root links, this is nice. Might want to add these to the FAQ. Do you know if there is a way to view snippets of the messages without having to click on each one?

            http://www.mail-archive.com/connectors-user@incubator.apache.org/index.html

            http://www.mail-archive.com/connectors-dev@incubator.apache.org/index.html

            http://www.mail-archive.com/general@incubator.apache.org/index.html

            1. I think this would best be added to the "mail.html" page, which
              describes the mail lists and how to sign up for them.

              Please feel free to open a jira ticket accordingly.

      2. The symptom is that if I specify a wrong system name, instead of barfing, it assumes localhost. If I specify a correct system name, it does the same thing, uses localhost.

        I'm using PostgreSQL 8.4 and looks like I updated the driver on 10/25/2010. Opened the jar, couldn't figure out how to tell what version I had.

        1. The other possibility is that you just have old software. The change that permitted non-localhost postgresql servers was committed to trunk about two months ago, and I don't think it's in any released version yet. See CONNECTORS-159.

  10. Judging by some of the recent posts, there seems to be an interest in figuring out how to run ManifoldCF in Eclipse. Well I'm going to do just that, got tried of debugging with print statements (smile)

    My questions have to do with running ManifoldCF outside the example dir (Jetty). When I deploy the war files in Tomcat it and try to bring up the Manifold UI, it complains about a missing "C:\lcf\properties.xml" file. First off, why is it looking in C:\lcf, is that a parm somewhere? All I did was drop the three war file into the webapps dir.

    Secondly, when I created the lcf dir and copied the properties.xml file along with connectors.xml and the connector jar files, the UI main page comes up. When I click on "List Output Connections", it takes a long time, then it comes back with a page that has the title bar on top reading "Document Ingestion" and blank page down below. I think this has to do with the agent process not running, I could not figure out the exact steps to run it. I'm on a windows platform, when you say run command "org.apache.manifoldcf.agents.AgentRun", what does that mean? I'm at a command prompt, looking at the directoy content where I see "example, lib, processes, and web".

    Any thoughts? Thanks!

    1. I just clarified the "how-to-build-and-deploy.html" page this past week to spell this out more clearly. Specifically, you need a -D switch when you start Tomcat to tell it where to find the properties.xml file. Second, for running a command, there's a script called executecommand.sh or executecommand.bat that accepts the command class name and command arguments.

      The symptom you are seeing is not consistent with a not-running agents process, I'm afraid. It sounds like something else, maybe a misconfigured synchronization directory.

  11. Anonymous

    Can the NULL output connector log details of everything it receives?

    I've got a job setup using the SharePoint repository connector and the NULL output connector, and I want to see details of what is being sent from SharePoint, but the manifoldcf.log doesn't contan much information.

    1. The best way to see what is going on is to look at the Simple History report in the crawler UI. All indexing attempts are available, as well as fetch activities from SharePoint.

      If you want to have the decisions of all the connectors logged, you can turn on connector logging in logging.ini. Set org.apache.manifoldcf.connectors=DEBUG. This will generate a lot of output, however, so you won't want to have this on for a production system.

      1. Anonymous

        Perfect thanks, this is producing detailed logs

  12. Anonymous

    I came across the ManifoldCF security search component and query parser here:

    https://svn.apache.org/repos/asf/incubator/lcf/integration/solr-3.x/trunk/

    This looks great, and should make handling security so much easier!

    The readme says this:

    There are two ways to hook up security to Solr in this package. The first is using a Query Parser plugin. The second is using a Search Component.

    But there is no explanation of what the pros and cons are of each - how should I choose which one to use?

    1. Part of the reason we supply both is because the set of pros and cons is so complex it is hard to even state it. I expect you will do best by assessing performance of each solution in your own environment. Or, you can ask for help from the Solr/Lucene people, via java-user@lucene.apache.org, but you might get contradictory advice. (wink)

  13. Anonymous

    I've got ManifoldCF configured with a SharePoint repository connector and a Solr outpur connector.

    I'm using the ManifoldCF security search component to authorise users for search results.

    The problem is that I have 2 Active Directory domains that I need to use for authorisation (one is for internal users, the other for extranet users). So I've setup two Active Directory authority connections, named 'Internal AD' and External AD'.

    But I can't see how to get this working, as I can only specify a single authority for the SharePoint repository connector.

    If I select one of the AD authorities as the authority for the SharePoint repository connector, then allow_token_document is always prefixed with the name of that authority, regardless of the domain the user/group actually belongs to. This isn't going to work with the ManifoldCF authority service, UserACLs, which prefixes SIDs with the name of the authority the SID belongs to.

    If I select 'None (Global Authority)' as the authority for the SharePoint repository connector then allow_token_document is not prefixed with the authority name, but of course those returned by the ManifoldCF authority service, UserACLs, are still prefixed with the authority name.

    I guess I could modify the ManifoldCF authority service, UserACLs, to take an extra parameter that would alter the behaviour so it doesn't prefix SIDs with the authority name... but I'd rather not be modifying the source if I can help it. Is there some way to achieve what I'm after?

    Hope this all makes sense (smile)

    1. First, for questions of this complexity you would be better advised to post to the connectors-user@incubator.apache.org list instead of dropping a comment in this FAQ. It's easy to sign up; the web site tells you how. See http://incubator.apache.org/connectors for more details.

      A quick answer, though, is that ManifoldCF's authority connector is not currently implemented to handle multiple related domains. You certainly don't want to try implementing a multi-domain solution any other way, either. But this is exactly the kind of improvement the team would be interested in implementing.

      My suggestion is therefore to create a ticket in Apache jira (at https://issues.apache.org) describing your domain setup in some detail, especially how the domains relate to one another, and the discussion can move to comments for that ticket.

      1. Anonymous

        Thanks, I've opened Jira ticket CONNECTORS-460

  14. Anonymous

    Hello,

    is there, or will there be a connector or something to crawl Lotus Notes documents including the security information?

    Or does somebody know how to implement such a crawler or where to start?

    1. We've gotten requests for a Lotus Notes connector in the past. We need someone with a test and development Notes setup. If you think you have such a setup, please create a ticket for a Lotus Notes connector and we'll try to work collaboratively to develop such a connector.

      Thanks!

  15. Anonymous

    Hello,
    I have been trying to connect ManifoldCF to Solr. I have a file system in a remote server, protected by active directory.
    I have configured a manifold job to import only a part of the documents under the file system. In fact, I do the importing process from a file which only contains 2 documents, in order to make it easier to see what is happening and get conclusions. Afterwards the documents are output to the solr server.
    I have created a request handler called "selectManifold" to "connect" manifold and solr. Then I call it via http://\[host\]:8080/solr/selectManifold?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&wt=&explainOther=&hl.fl=&AuthenticatedUserName=prueba1@lcg.test.tm . When doing this, tomcat's log (catalina.out) writes this:

    oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent prepare
    Información: Trying to match docs for user 'prueba1@lcg.test.tm'
    oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent getAccessTokens
    Información: For user 'prueba1@lcg.test.tm', saw authority response AUTHORIZED:Auth+active+directory+para+el+file+system (this one is the active directory I'm currently using for the job)
    oct 31, 2012 2:40:33 PM org.apache.solr.mcf.ManifoldCFSearchComponent getAccessTokens
    Información: For user 'prueba1@lcg.test.tm', saw authority response AUTHORIZED:ad (this one isn't)
    oct 31, 2012 2:40:33 PM org.apache.solr.core.SolrCore execute
    Información: \[\] webapp=/solr path=/selectManifold params={explainOther=&fl=*,score&indent=on&start=0&q=*:*&hl.fl=&wt=&fq=&version=2.2&rows=10&AuthenticatedUserName=prueba1@lcg.test.tm} hits=0 status=0 QTime=183

    So, it effectively connects and gets my user's tokens. In fact, if I go to http://[host]/mcf/UserACLs?username=prueba1@lcg.test.tm, this is the result:AUTHORIZED:Auth+active+directory+para+el+file+system
    TOKEN:active_dir:S-1-5-32-545
    TOKEN:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1111
    TOKEN:active_dir:S-1-5-21-2039231098-2614715072-2050932820-513
    TOKEN:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1113
    TOKEN:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1110
    TOKEN:active_dir:S-1-5-21-2039231098-2614715072-2050932820-1107
    TOKEN:active_dir:S-1-1-0
    AUTHORIZED:ad
    TOKEN:ad:S-1-5-32-545
    TOKEN:ad:S-1-5-21-2039231098-2614715072-2050932820-1111
    TOKEN:ad:S-1-5-21-2039231098-2614715072-2050932820-513
    TOKEN:ad:S-1-5-21-2039231098-2614715072-2050932820-1113
    TOKEN:ad:S-1-5-21-2039231098-2614715072-2050932820-1110
    TOKEN:ad:S-1-5-21-2039231098-2614715072-2050932820-1107
    TOKEN:ad:S-1-1-0

    Moreover, if I go to :8080/solr/admin/schema.jsp" class="external-link" rel="nofollow">http://host:8080/solr/admin/schema.jsp and search for the allow_token_document field, it says that active_dir:S-1-5-21-2039231098-2614715072-2050932820-1110
    (which appeared in the list of UserACLs) has frequency 2 (remember I only have 2 documents indexed)

    And still, when I call :8080/solr/selectManifold?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&wt=&explainOther=&hl.fl=&AuthenticatedUserName=prueba1@lcg.test.tm" class="external-link" rel="nofollow">http://host:8080/solr/selectManifold?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&wt=&explainOther=&hl.fl=&AuthenticatedUserName=prueba1@lcg.test.tm,
     it says no result has been found. Do you know why could it be?

    One final thing: when I call http://130.177.44.21:8080/solr/select?indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&wt=&explainOther=&hl.fl=, with the default handler (that is, without manifold)
    , it gives me a result with the 2 documents I indexed

    Sorry for the long post but I wanted you to have all the data.

    Pablo

    1. First of all, this is not the appropriate place to engage ManifoldCF committers and other users for help in diagnosing problems. I'd sign up for the user@manifoldcf.apache.org list if I were you, and post this question there.

      But here's a hint: The query that the manifoldcf solr plugin creates uses the attribute names you configure for it. If those aren't right, the query won't be right either. See if you can dump the full query; not sure exactly how you tell Solr you want to do that, but I am sure there's a way.

      1. Anonymous

        Thank you for your response. I understand I have to send an email to user@manifoldcf.apache.org

  16. Anonymous

    Hi there,
    I'm trying to connect ManifoldCF to an internal wiki at my company. The ManifoldCF wiki connector supplies a username and password field for the wiki api, however, at my company, a username and password is required to connect to the apache server running the wiki site, and after that authentication takes place, those credentials are passed on to the wiki api.

    So, essentially, I need a way to have ManifoldCF pass my windows credentials on when trying to make its connection. Using the api login fields does not work.

    Any tips?

    1. This also would be a good candidate for discussion on the user@manifoldcf.apache.org list.

      You will have to clarify exactly how Apache handles your credentials. Is it using Basic Auth? or session authentication? or something else? If the technology used is straightforward enough we'd be happy to consider an enhancement to the wiki connector.

      1. I tried emailing user@minifoldcf.apache.org, but it bounced back saying I needed to be a subscribed user. How would I go about subscribing?

        We use the Kerberos Module for Apache (http://modauthkerb.sourceforge.net/index.html) (AuthType Kerberos). My understanding based on that linked documentation is that this module does use Basic Auth to communicate with the browser.

        Anything we can try?

        1. To subscribe to the user@manifoldcf.apache.or email list, send a blank email to user-subscribe@manifoldcf.apache.org