Trix Trix - 6 months ago 21
PHP Question

Multiple wget to save multiple local files with naming convention

Summary

I am on a PHP news crawler project and want to pull RSS news feeds from nearly a hundred of news websites using wget (version 1.12) to capture whole RSS feed files, all in one directory (without hierarchy) in local server regarding:


  • Some of these websites do not have RSS feed and so I should capture and parse their
    HTML
    , but at the beginning I can just concentrate on XML feeds.

  • All feed files from all websites in one directory.

  • No extra content should be downloaded. all extra content (like images if any) should be hosted on the remote.

  • Performance is important

  • Feed files need to be renamed before save according to my convention like
    source.category.type.xml
    (each XML remote URL has its own source, category and type but not with my naming convention)

  • Some of these feeds do not include news timestamp like with
    <pubDate>
    and so I have to choose a good working approach to handle news time even with a slight difference but robust, working and always functional.

  • To automate it, I need to perform a cron job on this wget on regular basis



url-list.txt
includes:

http://source1/path/to/rss1
http://source2/diiferent/path/to/rss2
http://source3/path/to/rss3
.
.
.
http://source100/different/path/to/rss100


I want this:

localfeed/source1.category.type.xml
localfeed/source2.category.type.xml
localfeed/source3.category.type.xml
.
.
.
localfeed/source100.category.type.xml


Category and type can have multiple predefined values like
sport
, ...




What do I have?

At the very first level I should do my
wget
using a list of remote URLs: According to this wget instructions:


  1. url-list.txt
    should consist of a series of URLs, one per line

  2. When running
    wget
    without
    -N
    ,
    -nc
    ,
    -r
    , or
    -p
    , downloading the same file in the same directory will result in the original copy of
    FILE
    being preserved and the second copy being named
    FILE.1
    .

  3. Use of
    -O
    like
    wget -O FILE
    is not intended to mean simply "use the name FILE instead of the one in the URL". It outputs the whole downloads into just one file.

  4. Use
    -N
    for time stamping

  5. -w SECONDS
    will hold on for
    SECONDS
    seconds of time before next retrieval

  6. -nd
    forces
    wget
    not to create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions `.n')

  7. -nH
    disables generation of host-prefixed directories (the behavior which
    -r
    by default does).

  8. -P PREFIX
    sets directory prefix to PREFIX. The "directory prefix" is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree.

  9. -k
    converts links for offline browsing

    $ wget -nH -N -i url-list.txt





Issues with (wget & cron job and php):


  1. How to handle news time? is it better to save timestamp in file names like
    source.category.type.timestamp.xml
    or fetch change time using
    php
    s
    stat
    function like this:

    $stat = stat('source.category.type.xml');
    $time = $stat('mtime'); //last modification time


    or any other ideas (which is always working and robust)

  2. How to handle file names? I want to save files locally on a distinct convention (
    source.category.type.xml
    ) and so I think
    wget options
    like
    --trust-server-names
    or
    --content-disposition
    could not help. I think I should go to a while loop like this:

    while read url; do
    wget -nH -N -O nameConvention $url
    done < utl-list.txt


Answer Source

I suggest to stay away from wget for your task as it makes your life really complicated for no reason. PHP is perfectly fine to fetch downloads.

I would add all URLs into a database (it might be just a text file, like in your case). Then I would use a cronjob to trigger the script. On each run I would check a fixed number of sites and put their RSS feeds into the folder. E.g. with file_get_contents and file_put_contents you are good to go. This allows you full control over what to fetch and how to save it.

The I would use another script that goes over the files and does the parsing. Separating the scripts from beginning will help you to scale later on. For a simple site, just sorting the files by mtime should do the trick. For a big scaleout, I would use a jobqueue.

The overhead in PHP is minimal while the additional complexity by using wget is a big burden.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download