This page is designed to provide a list of the features and architectural changes that will be implemented in Nutch 2.X. It is important to recognize:
Offload url filtering and url normalization, URL state management, perhaps deduplication to \[http://code.google.com/p/crawler-commons/\]. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. |
--Externalize functionality to crawler-commons project \[http://code.google.com/p/crawler-commons/\] starting with robots handling-- |