Readseg is an alias for org.apache.nutch.segment.SegmentReader
This class is similar to readdb in that it dumps the contents of a segment. There are three ways we can use this class:
1st Usage: bin/nutch readseg -dump <segment_dir> <output> [general options]
-dump: Dumps content of a <segment_dir> as a text file to <output>.
[general options]: General options are provided below.
2nd Usage: bin/nutch readseg -list (<segment_dir1> ... | -dir <segments>) [general options]
-list: This arguement lists a synopsis of segments in specified directories, or all segments in a directory <segments>, and prints details of them to System.out.
<segment_dir1> ...: This should be a list of the paths for individual segment directories to process.
-dir <segments>: Should be a path to a directory that contains multiple segments.
[general options]: General options are provided below.
3rd Usage: bin/nutch readseg -get <segment_dir> <keyValue> [general options]
-get: This arguement gets a specified record from a segment, and prints it on System.out.
<segment_dir>: Path to the segment directory.
<keyValue>: This should be the value of the key (url) we wish to retreive specific information about. N.B. It is essential to put "double-quotes" around strings with spaces.
[general options]: General options are provided below.
-nocontent: Pass this to ignore the content directory.
-nofetch: To ignore the crawl_fetch directory.
-nogenerate: To ignore the crawl_generate directory.
-noparse: To ignore the crawl_parse directory.
-noparsedata: To ignore the parse_data directory.
-noparsetext: To ignore the parse_text directory.