bin/nutch fetch

The fetcher logs to stderr with fetcher output codes.

called java class


command line options

bin/nutch fetch [-verbose] <dir>


config file options

Our HTTP 'User-Agent' request header.


The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence.


Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name.


A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name.

An email address to advertise in the HTTP 'From' request header and User-Agent header.


A version string to advertise in the User-Agent header.


The default network timeout, in milliseconds.


The default length limit for downloaded content, in bytes. Content longer than this is truncated.


If true, the fetcher will attempt to use HTTP version 1.1 and gzip encoding.


The number of seconds the fetcher will delay between successive requests to the same server.


The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection).


The number of OutputThreads to use. When adjusting this, remember that each thread could be holding a raw page, it's DOM structure, plaintext, and extracted links in memory.


Controls how often the fetcher will dump progress statistics to the logs, in minutes.


The maximum number of unfetched requests to queue in memory.


The maximum number of completed (but unwritten) requests to queue in memory before throttling the fetcher.

The maximum number of distinct servers that may be referenced by queued requests.


The minimum number of robots.txt files to cache for inactive servers.


The maximum number of URLs that may be queued at once for a single host.


When there are fewer than this many servers in the fetcher's active queues, each server's queue of URLs will be pruned to fetcher.lowservers.maxurls.


See description of fetcher.lowservers.threshold.


The maximum number of times the fetcher will attempt to get a page that has encountered recoverable errors.


The maximum number of redirects the fetcher will follow when trying to fetch a page.

The maximum number of consecutive failures, excluding 404 errors, to allow on a given server before declaring it dead (note: each failure will have had up to fetcher.retry.max retries).

The maximum fetch error rate, excluding 404s, to allow for a given server before declaring it dead. Note: errors include transient issues, and multiple retries contribute to the score (so, getting the first page on the 3rd try gives you a .66 "failerr.rate").

A threshold on the minimum number of requests we issue to a host before applying At least this many requests will be issued before declaring a host dead due to error rate. Note: this setting does not affect!


Filename which contains list of hostnames we shouldn't fetch from.


Whether to use "long messages" is the trace portion of the logged output (if set to false, terse messages will be used).


Whether to log successful fetches in the trace log.


Whether to log 404/Not Found errors in the trace log.


How often throttling behavior should be readjusted based on current bandwidth usage, measured in seconds. Set to -1 to disable throttling.


The desired amount of bandwidth the fetcher should use (aside from DNS and TCP overhead), in kbits/s. Set to -1 to disable throttling. Note: This is not a cap, this is a target for bandwidth usage over time.


The number of threads that should be active initially.

MatthiasJaekle - 13 Mar 2004

  • No labels