I am doing a webcrawler and using threads to download pages.
The first limiting factor to the performance of my program is the bandwidth, I can never download more pages that it can get.
The second thing is what I interested. I am using threads to download many pages at same time, but as I create more threads, more sharing of processor occurs. Is there some metric/way/class of tests to determine what is the ideal number of threads or if after certain number, the performance doesn't change or decrease?
we've developped a multithreaded parrallel web crawler. Benchmarking troughput is the best way to get ideas on how the beast will handle his job. For a dedicated java server, one thread per core is a base to start, then the I/O comes into play and change.
Performances do decrease after certain number of threads. But it depends on the site you crawl too, on the OS you use, etc. Try to find a site with a merely constant response time to do your first benchmarks (like Google, but take differents services)
With slow websites, higher number of threads tends to compensate i/o blocking