Preventing duplicate or near duplicate documents from entering an index or tagging documents with a signature/fingerprint for duplicate field collapsing can be efficiently achieved with a low collision or fuzzy hash algorithm. Solr natively supports de-duplication techniques of this type via the
<Signature> class and allows for the easy addition of new hash/signature implementations. A Signature can be implemented several ways:
128 bit hash used for exact duplicate detection.
64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to index
Fuzzy hashing implementation from nutch for near duplicate detection. It's tunable but works best on longer text.
Other, more sophisticated algorithms for fuzzy/near hashing can be added later.
Adding in the de-duplication process will change the
allowDups setting so that it applies to an update Term (with
signatureField in this case) rather than the unique field Term. Of course the
signatureField could be the unique field, but generally you want the unique field to be unique. When a document is added, a signature will automatically be generated and attached to the document in the specified
There are two places in Solr to configure de-duplication: in
solrconfig.xml and in
SignatureUpdateProcessorFactory has to be registered in
solrconfig.xml as part of an Update Request Processor Chain, as in this example:
SignatureUpdateProcessorFactory takes several properties:
A Signature implementation for generating a signature hash. The full classpath of the implementation must be specified. The available options are described above, the associated classpaths to use are:
The fields to use to generate the signature hash in a comma separated list. By default, all fields on the document will be used.
The name of the field used to hold the fingerprint/signature. The field should be defined in schema.xml.
Enable/disable de-duplication processing.
|overwriteDupes||true||If true, when a document exists that already matches this signature, it will be overwritten.|
If you are using a separate field for storing the signature you must have it indexed:
Be sure to change your update handlers to use the defined chain, as below:
(This example assumes you have other sections of your request handler defined.)
The update processor can also be specified per request with a parameter of