kkjj kkjj - 2 years ago 88
Python Question

Clean API results to get the headlines of news articles?

I have been having trouble finding a way to pull out specific text info from the Guardian API for my dissertation. I have managed to get all my text onto Python but how do you then clean it to get say, just the headlines of the news articles?

This is a snippet of the API result that I want to pull out info from:

{
"response": {
"status":"ok",
"userTier":"developer",
"total":1869990,
"startIndex":1,
"pageSize":10,
"currentPage":1,
"pages":186999,
"orderBy":"newest",
"results":[
{
"id":"sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:36Z",
"webTitle":"Tour de France 2016: stage eight – live!",
"webUrl":"https://www.theguardian.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"apiUrl":"https://content.guardianapis.com/sport/live/2016/jul/09/tour-de-france-2016-stage-eight-live",
"isHosted":false
},
{
"id":"sport/live/2016/jul/09/serena-williams-v-angelique-kerber-wimbledon-womens-final-live",
"type":"liveblog",
"sectionId":"sport",
"sectionName":"Sport",
"webPublicationDate":"2016-07-09T13:21:02Z",
"webTitle":"Serena Williams v Angelique Kerber: Wimbledon women's final –
...

Answer Source

Hoping the OP adds the used code to the question.

One solution in python is, that whatever you get (from the methods offered by the requests module?) will be either already deeply nested structures you can well index into or you can easily map it to these structures (via json.loads(the_string_you_displayed).

Sample:

d = json.loads(the_string_you_displayed)
head_line = d['response']['results'][0]['webTitle']

Would give the value into headline that is stored in the first dict found in the results "array" (index 0) of the response entries value. (The question was updated so now, the full path is visible)

in case I read the sample snippet given correctly, and it has been cut during copy and paste here, as the given sample is (as is) invalid JSON.

If the text does not represent a valid JSON text, it will depend on sifting through text via substring or pattern matching and may well be very brittle ...

Update: So assuming the full response structure is stored inside a variable named data:

result_seq = data['response']['results']  # yields a list here
headlines = [result['webTitle'] for result in result_seq]

The last line works like so: This is a list comprehension compactly creating a list from all entries result in the result_seq by picking the value of the key webTitle in each dict.

An explicit for loop like solution picking them all would be:

result_seq = data['response']['results'] 
headlines = []
for result in result_seq:
    headlines.append(result['webTitle'])

This does not check for errors like result dicts without a key webTitle etc. but Python will raise a matching exception, and one can decide, if one likes to wrap the processing inside a try: except block or hope for the best ...

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download