Protocol implementations allow Nutch to use different protocols (ftp, http, file, etc.) to fetch documents. Implementation is done in plugins which allows users
- to activate only required protocol implementations, eg. block file:// access by simply keeping protocol-file deactivated
- choose from alternative implementations of one and the same protocol scheme - Nutch has multiple implementations of the http/https protocol scheme, every plugin focusing on different features
HTTP/HTTPS protocol plugins
Simple (no third-party dependencies) but error-tolerant HTTP/HTTPS protocol implementation (HTTP 1.0 and 1.1).
HTTP/HTTPS protocol based on Apache HttpClient, optionally with Basic, Digest and NTLM authentication schemes, form/post authentication and support to use proxy servers. See HttpAuthenticationSchemes and HttpPostAuthentication.
HTTP/HTTPS protocol based on on okhttp, supports
- HTTP 1.1 or http/2 (property http.useHttp2)
- usage of proxy servers
- efficient by reusing connection with a configurable connection pool (NUTCH-2896 and PR#697)
Browser-based HTTP/HTTPS protocol plugins
Nutch provides a couple of protocol plugins which fetch content not directly but using an intermediate web browser controlled via the Selenium browser automation library.
file:// access – protocol-file
ftp:// access – protocol-ftp
Samba – protocol-smb
(under development, see NUTCH-2856)