Trix Trix - 2 years ago 84
PHP Question

Multiple wget to save multiple local files with naming convention


I am on a PHP news crawler project and want to pull RSS news feeds from nearly a hundred of news websites using wget (version 1.12) to capture whole RSS feed files, all in one directory (without hierarchy) in local server regarding:

  • Some of these websites do not have RSS feed and so I should capture and parse their
    , but at the beginning I can just concentrate on XML feeds.

  • All feed files from all websites in one directory.

  • No extra content should be downloaded. all extra content (like images if any) should be hosted on the remote.

  • Performance is important

  • Feed files need to be renamed before save according to my convention like
    (each XML remote URL has its own source, category and type but not with my naming convention)

  • Some of these feeds do not include news timestamp like with
    and so I have to choose a good working approach to handle news time even with a slight difference but robust, working and always functional.

  • To automate it, I need to perform a cron job on this wget on regular basis



I want this:


Category and type can have multiple predefined values like
, ...

What do I have?

At the very first level I should do my
using a list of remote URLs: According to this wget instructions:

  1. url-list.txt
    should consist of a series of URLs, one per line

  2. When running
    , or
    , downloading the same file in the same directory will result in the original copy of
    being preserved and the second copy being named

  3. Use of
    wget -O FILE
    is not intended to mean simply "use the name FILE instead of the one in the URL". It outputs the whole downloads into just one file.

  4. Use
    for time stamping

  5. -w SECONDS
    will hold on for
    seconds of time before next retrieval

  6. -nd
    not to create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions `.n')

  7. -nH
    disables generation of host-prefixed directories (the behavior which
    by default does).

  8. -P PREFIX
    sets directory prefix to PREFIX. The "directory prefix" is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree.

  9. -k
    converts links for offline browsing

    $ wget -nH -N -i url-list.txt

Issues with (wget & cron job and php):

  1. How to handle news time? is it better to save timestamp in file names like
    or fetch change time using
    function like this:

    $stat = stat('source.category.type.xml');
    $time = $stat('mtime'); //last modification time

    or any other ideas (which is always working and robust)

  2. How to handle file names? I want to save files locally on a distinct convention (
    ) and so I think
    wget options
    could not help. I think I should go to a while loop like this:

    while read url; do
    wget -nH -N -O nameConvention $url
    done < utl-list.txt

Answer Source

I suggest to stay away from wget for your task as it makes your life really complicated for no reason. PHP is perfectly fine to fetch downloads.

I would add all URLs into a database (it might be just a text file, like in your case). Then I would use a cronjob to trigger the script. On each run I would check a fixed number of sites and put their RSS feeds into the folder. E.g. with file_get_contents and file_put_contents you are good to go. This allows you full control over what to fetch and how to save it.

The I would use another script that goes over the files and does the parsing. Separating the scripts from beginning will help you to scale later on. For a simple site, just sorting the files by mtime should do the trick. For a big scaleout, I would use a jobqueue.

The overhead in PHP is minimal while the additional complexity by using wget is a big burden.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download