windboy windboy - 6 months ago 17
JSON Question

Getting list of keywords from JSON

I have encountered a problem and I don't understand why it printed out this way.

Below is my code, please forgive me for the bad formatting as I am new to programming, this is to open a text file which has a bunch of keywords

import urllib2
import json

f1 = open('CatList.text')
lines = f1.readlines()

for line in lines:

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'

print(url)

json_obj = urllib2.urlopen(url)
data = json.load(json_obj)

#to write the result
f2 = open('SubList.text', 'w')

f2.write(url)

for item in data['query']:

for i in data['query']['categorymembers']:


f2.write((i['title']).encode('utf8')+"\n")


I get the error:

Traceback (most recent call last):
File "Test2.py", line 16, in <module>
json_obj = urllib2.urlopen(url)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 402, in open
req = meth(req)
File "/usr/lib/python2.7/urllib2.py", line 1113, in do_request_
raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>


I am not sure what this error means but I tried this to print the url.

import urllib2
import json

f1 = open('CatList.text')
f2 = open('SubList.text', 'w')
lines = f1.readlines()

for line in lines:

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'

print(url)
f2.write(url+'\n')


The results I have gotten were weird (below is part of the result):

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography by place
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography awards and competitions
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography conferences
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography education
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Environmental studies
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Exploration
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geocodes
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographers
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical zones
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geopolitical corridors
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:History of geography
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Land systems
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Landscape
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists
&cmlimit=100


Notice that the URL is broken up into 2 parts

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists
&cmlimit=100


instead of

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists&cmlimit=100


My first question is how can I fix this?

Secondly, is this what that is giving me the error?

My CatList.text is as follows:

Category:Branches of geography
Category:Geography by place
Category:Geography awards and competitions
Category:Geography conferences
Category:Geography education
Category:Environmental studies
Category:Exploration
Category:Geocodes
Category:Geographers
Category:Geographical zones
Category:Geopolitical corridors
Category:History of geography
Category:Land systems
Category:Landscape
Category:Geography-related lists
Category:Lists of countries by geography
Category:Navigation
Category:Geography organizations
Category:Places
Category:Geographical regions
Category:Surveying
Category:Geographical technology
Category:Geography terminology
Category:Works about geography
Category:Geographic images
Category:Geography stubs


Sorry for the long post. I really appreciate your help. Thank you.

Answer

Update the following line

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'  

to

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line.strip()+'&cmlimit=100'  

Your line contains line-feed (\n) characters which will be removed using .strip() which removes white-spaces from both ends of a string.