Crawler and IR Bibliography
This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to CiteSeer if applicable.
Add summaries of the papers at will!
Crawling Strategies and Techniques
- Efficient URL caching for World Wide Web crawling
- Cho (2002) Parallel Crawlers
- Cho et al. (2002) Efficient Crawling Through URL Reordering Compares different methods in which order a part of the web should be crawled, among them depth-first and breadth-first crawls, and crawls with pages ordered by PageRank. Among those, the PageRanked order does best, and breadth-first is second
- Fiedler, Hammer (1999) Using the Web Efficiently: Mobile Crawlers
- Cho (2002) Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data. (PDF) Jungho Cho's Ph.D. thesis. Contains the contents of Cho's other crawler papers (e.g. URL Reordering)
- Broder et al (2003) Efficient URL Caching for World Wide Web Crawling The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
Focused Crawling
(I haven't reviewed these yet)
- Bergmark et al. (2002) Focused Crawls, Tunneling, and Digital Libraries
- Chakrabarti, van den Berg, Dom (1999) Focused crawling: a new approach to topic-specifig Web resource discovery
- Menczer et al. (2002) Topic-Driven Crawlers: Machine Learning Issues
- Chau, Chen () Personalized and Focused Web Spiders
Descriptions of Existing Crawlers
- Shkapenyuk, Suel (2001) Design and Implementation of a High-Performance Distributed Web Crawler
- Heydon, Najork (2000) High Performance Web-Crawling
- Heydon, Najork (1999) Mercator: A Scalable, Extensible Web Crawler (PDF)
- Boldi et al. (2002) UbiCrawler: A Scalable Fully Distributed Web Crawler
Crawler Implementations
Offline (Books)
- Witten, Moffat, Bell: Managing Gigabytes
- Baeza-Yates, Ribeiro-Neto: Modern Information Retrieval
- Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data with a chapter on crawlers
Structure and Dynamics of the Web
Page Structure
- Pirolli, Pitkow, Rao (1996) Silk from a Sow's Ear: Extracting usable Structures from the Web experiments on how web pages may be classified i.e. in home pages, personal pages, etc.
Web Structure
- Searching the Workplace Web Implications for search engines in Intranets
- Kumar et al. (2000) The Web as a graph
- Broder et al. () Graph structure in the web - Summarizes in- and outdegree distributions of web pages, the sizes of strongly and weakly connected components (all forming scale-free power law distributions), and introduces the notion of the web graph forming a "Bowtie" with several parts ("giant SCC", IN nodes, OUT nodes, and so called Tendrils) of equal size. It concludes the diameter of the giant SCC is at least 28, but 75% of the time there is no direct link between any two random pages in the web.
- Dill, Kumar et al. (2001) Self-Similarity in the Web - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.
- Guillaume, Latapy (2002) The Web Graph: an Overview (not reviewed)
Accessing the Web Graph
- Bharat, Broder, Henzinger (1998) The Connectivity Server: Fast Access to Linkage Information on the Web
- Suel, Yuan (2001) Compressing the Graph Structure of the Web
- Randall et al. (2001) The Link Database: Fast Access to Graphs of the Web
- Cooper, Frieze (2001) A general model of web graphs (not reviewed)
Dynamics
- A large-scale study of the evolution of Web pages
- Cho, Garcia-Molina (2001) Estimating Frequency of Change
Mining the Hypertext
Classics
- Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.
- Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)
More about PageRank
Other
Index Maintenance
- Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation
- Dynamic Maintenance of Web Indexes Using Landmarks
Miscellaneous
- Brandman et al. (2000) Crawler-Friendly Web Servers (not reviewed yet)
- McLearn (2002) Autonomous Cooperating Web Crawlers (not reviewed yet)
- Gupta, Campbell (2000) Internet Search Engine Freshness by Web Server Help (not reviewed yet)
Exploiting Web Structure
- Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages
Other Bibliographies
Further Reading...