Sitz Blogz Sitz Blogz - 1 month ago 9
Python Question

urllib2.HTTPError: While Web Scraping a huge list

The web page has a huge list of journal names with other details. I am trying to scrape the table content into dataframe.

#http://www.citefactor.org/journal-impact-factor-list-2015.html

import bs4 as bs
import urllib #Using python 2.7
import pandas as pd

dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
for df in dfs:
print(df)
df.to_csv('citefactor_list.csv', header=True)


But I am getting following error .. I did try referring to some already raised questions but could not fix.

Error:

Traceback (most recent call last):
File "scrape_impact_factor.py", line 7, in <module>
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 896, in read_html
keep_default_na=keep_default_na)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 733, in _parse
raise_with_traceback(retained)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 727, in _parse
tables = p.parse_tables()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 196, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 450, in _build_doc
return BeautifulSoup(self._setup_build_doc(), features='html5lib',
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 443, in _setup_build_doc
raw_text = _read(self.io)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 130, in _read
with urlopen(obj) as url:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 60, in urlopen
with closing(_urlopen(*args, **kwargs)) as f:
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error

Answer

A 500 internal server error means something went wrong on the server and therefore is out of your control.

However the problem is that you are using the wrong URL.

If you go to http://www.citefactor.org/journal-impact-factor-list-2015.html/ in your browser you get a 404 not found error. Remove the trailing slash i.e. http://www.citefactor.org/journal-impact-factor-list-2015.html and it will work.