inetbot web crawler
Main  |  Get access to the repository  |  API  |  The robot  |  Publications  |  Usenet Groups  |  Plainweb  |  Authors / Contact  | 
 inetbot - Distributed Web Crawling

In this project we want to target the problem of efficient distributed web crawling by using the bandwidth of home computers all over the world. These computers (clients) are used to retrieve web sites, to detect changes in web sites and to send only those information to a server which have been changed or are necessary to keep an index up-to-date. All clients form a kind of P2P network by establishing connections to other clients. Over this network the clients are able to communicate with each other and to coordinate web crawling fully automatically. Only a very small amount of communication with the server is necessary. The creation of such a distributed web crawling system includes
  • the construction of efficient routing protocols in rather unstable networks
  • the development of compression strategies
  • the development of "intelligent" clients
  • the construction of efficient and near optimal communication networks
  • the design and implementation of a client which might be implemented as a plugin for the Internet Explorer or a single program that is running in the background and gathers web sites periodically
The main reason for our effort is the exponential growth of the internet. To be able to keep an index containing billion of web sites up-to-date a very large number of web sites have to be crawled continuously resulting in downloading massive amount of data. From a servers point of view a lot of resources like bandwidth and storage can be saved by the usage of distributed web crawling by using bandwidth and storage of home computers.

The index will still be maintained by the server to be able to implement effective ranking functions and to minimize the response time to an arbitrary query.

back
Copyright © 2005 inetbot   -   All rights reserved