DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Protocol implementations allow Nutch to use different protocols (ftp, http, file, etc.) to fetch documents. Implementation is done in plugins which allows users
- to activate only required protocol implementations, eg. block file:// access by simply keeping protocol-file deactivated
- choose from alternative implementations of one and the same protocol scheme - Nutch has multiple implementations of the http/https protocol scheme, every plugin focusing on different features
HTTP/HTTPS protocol plugins
protocol-http
Simple (no third-party dependencies) but error-tolerant HTTP/HTTPS protocol implementation (HTTP 1.0 and 1.1).
protocol-httpclient
HTTP/HTTPS protocol based on Apache HttpClient, optionally with Basic, Digest and NTLM authentication schemes, form/post authentication and support to use proxy servers. See HttpAuthenticationSchemes and HttpPostAuthentication.
protocol-okhttp
HTTP/HTTPS protocol based on on okhttp, supports
- HTTP 1.1 or http/2 (property http.useHttp2)
- usage of proxy servers
- efficient by reusing connection with a configurable connection pool (NUTCH-2896 and PR#697)
Browser-based HTTP/HTTPS protocol plugins
Nutch provides a couple of protocol plugins which fetch content not directly but using an intermediate web browser controlled via the Selenium browser automation library.
protocol-selenium
See README.
protocol-interactiveselenium
See README.
protocol-htmlunit
file:// access – protocol-file
ftp:// access – protocol-ftp
Samba – protocol-smb
(under development, see NUTCH-2856)