sc8ing sc8ing - 2 months ago 7
HTML Question

Determining where certain text comes from on website

I'm trying to write a bash script that downloads the Photo of the Day from National Geographic, sets it as the desktop background, and puts the description of the picture found on the page in a text file on the desktop. (I'm aware there are scripts out there that do this, but NG recently changed their POTD page and they no longer work.)

I've gotten the picture to download and become the desktop background, but am stuck as to how to download the image's full description (the one found below the picture on the website, not the shorter version in the metadata in the header). Trouble is, the description doesn't appear in the page that my script downloads with

curl
(or
wget
for that matter). It's obviously there when view in the browser, though.

Where is the description text coming from if it's not in the html file? How can I download/parse the description, preferably with bash or python?

Thanks for any help.

Answer

Buried within the html for that National Geographic page is the following attribute:

data-platform-endpoint="http://www.nationalgeographic.com/photography/photo-of-the-day/_jcr_content/.gallery.2016-09.json"

The caption that you seek is in the JSON file that that URL points to. For example, in today's version of that JSON file, we find:

"caption":"<p>A giraffe leads a herd of zebras as the animals stamede from a threat unseen. Your Shot photographer Mohammed AlNaser captured this image in Tanzania\u2019s Serengeti National Park. The zebras \u201cemerged from nowhere,\u201d AlNaser writes. \u201cThey were obviously drinking water and something scared them and created a few seconds of a chaos.\u201d<\/p>\n"
Comments