Date: Tue, 19 Mar 2024 04:13:07 +0000 (UTC) Message-ID: <761043153.54086.1710821587338@cwiki-he-fi.apache.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_54085_337933053.1710821587338" ------=_Part_54085_337933053.1710821587338 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This page is designed to provide a list of the features and architectura= l changes that will be implemented in Nutch 2.X. It is important to recogni= ze:
<= /p>
== p>
=
Offload url filteri= ng and url normalization, URL state management, perhaps deduplication to [h= ttp://code.google.com/p/crawler-commons/]. We should coordinate our efforts= , and share code freely so that other projects (bixo, heritrix,droids) may = contribute to this shared pool of functionality, much like Tika does for th= e common need of parsing complex formats.
== p>
=
<= p><= /p>
<= /p>
== p>
=
-Externalize functionality to crawler-commons = project [http://code.google.com/p/crawler-commons/] starting with robots ha= ndling-
<= p><= /p>
== p>
=