Crawling a country: Better strategies than breadth-first for web page ordering

Baeza-Yates, R; Castillo C.; Marin M.; Rodríguez A.

Keywords: search, graphs, world, strategy, scheduling, policies, web, wide, ordering, engines, page, crawlers, PageRank, Breadth-first, Crawling

Abstract

" This article compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most ""important"" pages ""early"" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages.We use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations. "

Más información

Título de la Revista: 1604-2004: SUPERNOVAE AS COSMOLOGICAL LIGHTHOUSES
Editorial: ASTRONOMICAL SOC PACIFIC
Fecha de publicación: 2005
Página de inicio: 864
Página final: 872
URL: http://www.scopus.com/inward/record.url?eid=2-s2.0-77953053635&partnerID=q2rCbXpz