Crawler and IR Bibliography

This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to CiteSeer if applicable.

Add summaries of the papers at will!

Crawling Strategies and Techniques

Efficient URL caching for World Wide Web crawling
Cho (2002) Parallel Crawlers
Cho et al. (2002) Efficient Crawling Through URL Reordering Compares different methods in which order a part of the web should be crawled, among them depth-first and breadth-first crawls, and crawls with pages ordered by PageRank. Among those, the PageRanked order does best, and breadth-first is second
Fiedler, Hammer (1999) Using the Web Efficiently: Mobile Crawlers
Cho (2002) Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data. (PDF) Jungho Cho's Ph.D. thesis. Contains the contents of Cho's other crawler papers (e.g. URL Reordering)
Broder et al (2003) Efficient URL Caching for World Wide Web Crawling The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.

Focused Crawling

(I haven't reviewed these yet)

Descriptions of Existing Crawlers

Crawler Implementations

The Mercator Crawler

Offline (Books)

Witten, Moffat, Bell: Managing Gigabytes
Baeza-Yates, Ribeiro-Neto: Modern Information Retrieval
Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data with a chapter on crawlers

Structure and Dynamics of the Web

Page Structure

Pirolli, Pitkow, Rao (1996) Silk from a Sow's Ear: Extracting usable Structures from the Web experiments on how web pages may be classified i.e. in home pages, personal pages, etc.

Web Structure

Searching the Workplace Web Implications for search engines in Intranets
Kumar et al. (2000) The Web as a graph
Broder et al. () Graph structure in the web - Summarizes in- and outdegree distributions of web pages, the sizes of strongly and weakly connected components (all forming scale-free power law distributions), and introduces the notion of the web graph forming a "Bowtie" with several parts ("giant SCC", IN nodes, OUT nodes, and so called Tendrils) of equal size. It concludes the diameter of the giant SCC is at least 28, but 75% of the time there is no direct link between any two random pages in the web.
Dill, Kumar et al. (2001) Self-Similarity in the Web - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.
Guillaume, Latapy (2002) The Web Graph: an Overview (not reviewed)

Accessing the Web Graph

Dynamics

A large-scale study of the evolution of Web pages
Cho, Garcia-Molina (2001) Estimating Frequency of Change

Mining the Hypertext

Classics

Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.
Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)

More about PageRank

Extrapolation Methods for Accelerating PageRank Computations

Other

A New Paradigm for Ranking Pages on the World Wide Web

Index Maintenance

Miscellaneous

Brandman et al. (2000) Crawler-Friendly Web Servers (not reviewed yet)
McLearn (2002) Autonomous Cooperating Web Crawlers (not reviewed yet)
Gupta, Campbell (2000) Internet Search Engine Freshness by Web Server Help (not reviewed yet)

Exploiting Web Structure

Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages

Space shortcuts

Page tree

Crawler and IR Bibliography

Crawling Strategies and Techniques

Focused Crawling

Descriptions of Existing Crawlers

Crawler Implementations

Offline (Books)

Structure and Dynamics of the Web

Mining the Hypertext

Index Maintenance

Miscellaneous

Exploiting Web Structure

Other Bibliographies

Space shortcuts

Page tree

LuceneLARMPages PapersOnCrawlers

Crawler and IR Bibliography

Crawling Strategies and Techniques

Focused Crawling

Descriptions of Existing Crawlers

Crawler Implementations

Offline (Books)

Structure and Dynamics of the Web

Mining the Hypertext

Index Maintenance

Miscellaneous

Exploiting Web Structure

Other Bibliographies