Category Archives: crawling

a crawler using wget and xargs

How long would it take to crawl a billion pages using wget and xargs? We’re on a quest to write a scalable web crawler. Our goal is to build a web crawler that will download a billion pages a week. We’ve calculated that to download a billion pages in a week we need to sustain [...]
Posted in crawling | 10 Comments

Desirable Properties for a Web Crawler

I aim to build a web crawler that can download a billion pages in a week. Below are some desirable properties any web crawler should have: Scalability The web is enormous and continually growing. A crawler should scale linearly with the number of agent-machines that are added to the system. This allows us to add [...]
Posted in crawling | 2 Comments