Summary
I am on a PHP news crawler project and want to pull RSS news feeds from nearly a hundred of news websites using wget (version 1.12) to capture whole RSS feed files, all in one directory (without hierarchy) in local server regarding:
HTML
source.category.type.xml
<pubDate>
url-list.txt
http://source1/path/to/rss1
http://source2/diiferent/path/to/rss2
http://source3/path/to/rss3
.
.
.
http://source100/different/path/to/rss100
localfeed/source1.category.type.xml
localfeed/source2.category.type.xml
localfeed/source3.category.type.xml
.
.
.
localfeed/source100.category.type.xml
sport
wget
url-list.txt
wget
-N
-nc
-r
-p
FILE
FILE.1
-O
wget -O FILE
-N
-w SECONDS
SECONDS
-nd
wget
-nH
-r
-P PREFIX
-k
$ wget -nH -N -i url-list.txt
source.category.type.timestamp.xml
php
stat
$stat = stat('source.category.type.xml');
$time = $stat('mtime'); //last modification time
source.category.type.xml
wget options
--trust-server-names
--content-disposition
while read url; do
wget -nH -N -O nameConvention $url
done < utl-list.txt
I suggest to stay away from wget for your task as it makes your life really complicated for no reason. PHP is perfectly fine to fetch downloads.
I would add all URLs into a database (it might be just a text file, like in your case). Then I would use a cronjob to trigger the script.
On each run I would check a fixed number of sites and put their RSS feeds into the folder. E.g. with file_get_contents
and file_put_contents
you are good to go. This allows you full control over what to fetch and how to save it.
The I would use another script that goes over the files and does the parsing. Separating the scripts from beginning will help you to scale later on.
For a simple site, just sorting the files by mtime
should do the trick. For a big scaleout, I would use a jobqueue.
The overhead in PHP is minimal while the additional complexity by using wget is a big burden.