I am on a PHP news crawler project and want to pull RSS news feeds from nearly a hundred of news websites using wget (version 1.12) to capture whole RSS feed files, all in one directory (without hierarchy) in local server regarding:
wget -O FILE
$ wget -nH -N -i url-list.txt
$stat = stat('source.category.type.xml');
$time = $stat('mtime'); //last modification time
while read url; do
wget -nH -N -O nameConvention $url
done < utl-list.txt
I suggest to stay away from wget for your task as it makes your life really complicated for no reason. PHP is perfectly fine to fetch downloads.
I would add all URLs into a database (it might be just a text file, like in your case). Then I would use a cronjob to trigger the script.
On each run I would check a fixed number of sites and put their RSS feeds into the folder. E.g. with
file_put_contents you are good to go. This allows you full control over what to fetch and how to save it.
The I would use another script that goes over the files and does the parsing. Separating the scripts from beginning will help you to scale later on.
For a simple site, just sorting the files by
mtime should do the trick. For a big scaleout, I would use a jobqueue.
The overhead in PHP is minimal while the additional complexity by using wget is a big burden.