Apache Solr Documentation

6.5 Ref Guide (PDF Download)
Solr Tutorial
Solr Community Wiki

Older Versions of this Guide (PDF)

Ref Guide Topics

Meta-Documentation

*** As of June 2017, the latest Solr Ref Guide is located at https://lucene.apache.org/solr/guide ***

Please note comments on these pages have now been disabled for all users.

Skip to end of metadata
Go to start of metadata

Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the <Signature> class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented several ways:

Method

Description

MD5Signature

128 bit hash used for exact duplicate detection.

Lookup3Signature

64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to index

TextProfileSignature

Fuzzy hashing implementation from nutch for near duplicate detection. It's tunable but works best on longer text.

Other, more sophisticated algorithms for fuzzy/near hashing can be added later.

Adding in the de-duplication process will change the allowDups setting so that it applies to an update Term (with signatureField in this case) rather than the unique field Term. Of course the signatureField could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified signatureField.

Configuration Options

There are two places in Solr to configure de-duplication: in solrconfig.xml and in schema.xml.

In solrconfig.xml

The SignatureUpdateProcessorFactory has to be registered in solrconfig.xml as part of an Update Request Processor Chain, as in this example:

#666666xmlsolid true id false name,features,cat solr.processor.Lookup3Signature ]]>

The SignatureUpdateProcessorFactory takes several properties:

Parameter

Default

Description

signatureClass

org.apache.solr.update.processor.Lookup3Signature

A Signature implementation for generating a signature hash. The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are:

  • org.apache.solr.update.processor.Lookup3Signature
  • org.apache.solr.update.processor.MD5Signature
  • org.apache.solr.update.process.TextProfileSignature

fields

all fields

The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used.

signatureField

signatureField

The name of the field used to hold the fingerprint/signature. The field should be defined in schema.xml.

enabled

true

Enable/disable de-duplication processing.

overwriteDupestrueIf true, when a document exists that already matches this signature, it will be overwritten.

In schema.xml

If you are using a separate field for storing the signature you must have it indexed:

#666666xmlsolid]]>

Be sure to change your update handlers to use the defined chain, as below:

#666666xmlsolid dedupe ... ]]>

(This example assumes you have other sections of your request handler defined.)

The update processor can also be specified per request with a parameter of update.chain=dedupe.

  • No labels

10 Comments

  1. In the table, the default value for signatureField is signatureField... should the default be "id" ?

    1. That would seem to be in agreement with the example XML, at least. But putting the computed signature into a field called "id" seems like it would be prone to overwriting the manually supplied unique ID key for a document. If "id" really is the default field name for the signature, it seems like that should be changed.

      I looked in org/apache/solr/update/processor/SignatureUpdateProcessorFactory.java in the source distribution for Solr 5.5 and the default name for "signatureField" really is "signatureField. This makes no sense, but (why include the word "field" as part of the field name?) but it does appear to be correct.

  2. Based on the email thread., I think the following text should be somewhere in the guide:

    The atomic updates are processed as part of the DistributedUpdateProcessor (so they execute on the leader and work with optimistic concurrency) but that means if you have the SignatureUpdateProcessorFactory configured before the DistributedUpdateProcessorFactory it could compute a signature based on the raw doc you send (with the updatecommands) instead of the "real" doc with the updates applied.

    For a situation where you want the signatureField to *be* the uniqueKey, then you kind of have to put SignatureUpdateProcessorFactory before DistributedUpdateProcessorFactory 

  3. It would be great if you updated the examples so they were sufficient to be implemented simply by following their general form.

    Adding:

    <processor class="solr.RunUpdateProcessorFactory"/>

    To the example updateRequestProcessChain would be a great start (this line is required for the chain to actually have any effect).


    Similarly, <requestHandler name="/update" > throws an error for being a requestHandler without a class parameter, and the example should be replaced with one which is valid.

    1. Good feedback, Alex, thank you. I fixed the examples a bit; there's probably some more to do, but they are more useable now.

  4. This is as much a genuine question that I'm currently struggling with as it is a suggestion for improving this particular slice of the Solr documentation pie.

    Supposing the configuration settings being used still adds separate documents to the index that are considered to be duplicates (i.e., same value for the signature field). How would one construct a query to exclude documents that are considered duplicates of other results? How would one construct a query to include the duplicate documents in search results?

    I think it would be helpful to include this information in this page – or at least to allude to it, and if possible to link to some other page that explains what to do. If there is a first class query concept of "exclude/include duplicates," I haven't been able to find it. This is a good opportunity for Solr to illustrate that it has parity with other non-open source search solutions in this regard.

    1. Apparently, setting a filter query to do field collapsing as outlined in "Collapse and Expand Results", using the field containing the signature for the collapse target field, is what's required. I think it would be very helpful to make mention of this in this article.

  5. I think this documentation is misleading based on the functionality I'm observing.  The way I read this page I would expect that I wouldn't get any duplicate documents in the index, however this does not seem to be the case unless I specify the signatureField as the <uniqueKey> in schema.xml

  6. I followed the example above, but I'm still getting duplicates.

    Added the following to the solrconfig.xml  

    <updateRequestProcessorChain name="dedupe">
    <processor class="solr.processor.SignatureUpdateProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="signatureField">id</str>
    <bool name="overwriteDupes">true</bool>
    <str name="fields">id</str>
    <str name="signatureClass">solr.processor.Lookup3Signature</str>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>

    ...

    <requestHandler name="/update" class="solr.UpdateRequestHandler" >
    <lst name="defaults">
    <str name="update.chain">dedupe</str>
    </lst>
    </requestHandler>

    Also added the following to schema.xml

    <field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />

    Am I missing something?

  7. Is there any way to remove duplicates based on the status of the record?