Multi-Lingual Support in Nutch
17 June 2005
The goal of this proposal is to provide a solution for multi-lingual support in Nutch. Multi-lingual support means to be able to use a language specific Analyzer during searching and analysing.
The configuration of this behaviour is done using the standard
plugin.excludes properties of the nutch configuration file. For instance, to add French, English and German analysis support for both document analysis and query parsing, just add in the Nutch configuration file:
If no language specific analyzer plugin is specified in the configuration then the default
NutchAnalyzer implementation (ie
NutchDocumentAnalyzer will be used for document analysis (similar to the actual implementation), and no language dependent analysis will be performed on the
Nutch Analysis Plugin
NutchAnalyzer Extension Point
NutchAnalyzer interface defines a new Nutch extension point. This interface is simply an abstract class that extends the Lucene
Analyzer class, so that Lucene analyzers could be easily integrated as
NutchAnalyzer extensions should define the attribute "
lang" in order to be associated to a particular language code.
AnalyzerFactory class is responsible of instanciating the
NutchAnalyzer implementation to use depending on the Nutch analysis plugins configuration and a specified language code.
AnalyzerFactory policy for finding the
NutchAnalyzer extension to use is to return the first one that match a specified language code. If none is found, then the default
NutchDocumentAnalyzer is returned.
Language Code Source
Impacts on the actual code is very light to add multi-lingual capabilities for document analysis. First, on the part of the code of the
IndexSegment class that add a document to the index, the line
should be replaced by
so that, the
IndexWriter is called with the good
(Note by KurosakaTeruhiko) This seems to have been implemented in Nutch 0.8. The following lines are found in
IndexSegment which no longer exists in Nutch 0.8:
NutchDocumentAnalyzer class must implement the
(Note by KurosakaTeruhiko) This has been done in Nutch 0.8.
Language Code Source
The language specific query analysis is based on a
lang attribute like for the Document Analysis. But the
lang attribute in this case must be retrieved from the front-end using the following policy:
- Use an optional
langattribute provided by the search interface. 2. If no such attribute is provided by the search interface, then uses the Browser language. 3. (try to identify the query language using the LanguageIdentifierPlugin)(Note by KurosakaTeruhiko: This probably won't work well because queries are usually too short to tell a language. "chat" can be English or French, for example. What language is "Euro"?)
The query analysis requires more code modifications than the document analysis. The first impact on the searcher code is to be able to get the
lang attribute from the front-end. This is done by adding the following method in the
This method then uses the
AnalyzerFactory to retrieve the analyzer to use for parsing the query terms. The
NutchAnalysis class that is only used as a query parser, will be renamed to
QueryParser (like in Lucene), and will be very similar to the Lucene
QueryParser by providing a
parse method with an
Analyzer as parameter:
(Note by AlessandroGasparini) In Nutch 0.8.1 this is already implemented in the
(Note by AlessandroGasparini) The specified analizers performs language Stemming so if you would you can search for "retrieval" and you can get all the documents in which the stem "retriev" appears. On my configuration i used the LanguageIdentifier, and the analysis-(language) plugins for both the crawler and the web application. I've attached my nutch-site.xml to show the complete plugin configuration nutch-site.xml
The use of stemming needs to be carefully evaluated for a given corpus and target audience. There are a few aspects to be considered:
- stemming is language-specific, but if the corpus is mixed-language, even if we detect the language of each document and apply a proper stemmer, we are still facing the problem of correct identification of the language of the query (and applying the correct stemmer to the query - see above).
- stemming is likely to increase recall, but also it's likely to reduce precision. It's more useful for morphologically rich languages, and it's also more useful for smaller collections (where users prefer to receive any results, even if they are less precise, over receiving none results whatsoever - thus trading precision for recall). For larger corpora consisting of mostly mono-lingual documents stemming usually doesn't improve quality of results.
- there is usually more than one different stemmer implementation for a given language, each giving different results in terms of precision/recall for a given corpus. Sometimes an aggressive, iterative stemmer (such as the Porter stemmer) may give worse results than a light custom stemmer that only conflates single/plural forms.