I am trying to scrape impact factors of journals from a particular website or entire web. I have been searching for something close but hard luck..
This is the first time I am trying web scrape with python. I am trying to find the simplest way.
I have a list of ISSN numbers belong to Journals and I want to retrieve the impact factor values of them from web or a particular site. The list has more than 50K values so manually searching the values is practically hard .
Index,JOURNALNAME,ISSN,Impact Factor 2015,URL,ABBV,SUBJECT
1,4OR-A Quarterly Journal of Operations Research,1619-4500,,,4OR Q J OPER RES,Management Science
2,Aaohn Journal,0891-0162,,,AAOHN J,
3,Aapg Bulletin,0149-1423,,,AAPG BULL,Engineering
4,AAPS Journal,1550-7416,,,AAPS J,Medicine
5,Aaps Pharmscitech,1530-9932,,,AAPS PHARMSCITECH,
6,Aatcc Review,1532-8813,,,AATCC REV,
7,Abdominal Imaging,0942-8925,,,ABDOM IMAGING,
8,Abhandlungen Aus Dem Mathematischen Seminar Der Universitat Hamburg,0025-5858,,,ABH MATH SEM HAMBURG,
9,Abstract and Applied Analysis,1085-3375,,,ABSTR APPL ANAL,Math
10,Academic Emergency Medicine,1069-6563,,,ACAD EMERG MED,Medicine
Impact Factor 2015
Try this code using beautiful soup and urllib2. I am using h2 tag and searching for 'Journal Impact:', but I will let you decide on the algorithm to extract the data. The html content is present in soup and soup provides API to extract it. What I provide is an example and that may work for you.
#!/usr/bin/env python import urllib2 from bs4 import BeautifulSoup issn = '0219-5305' url = 'https://www.researchgate.net/journal/%s_Analysis_and_Applications' % (issn) htmlDoc = urllib2.urlopen(url).read() soup = BeautifulSoup(htmlDoc, 'html.parser') for tag in soup.find_all('h2'): if 'Journal Impact:' in tag.text: value = tag.text value = value.replace('Journal Impact:', '') value = value.strip(' *') print value
I think the official documentation for beautiful soup is pretty good. I will suggest spending an hour on the documentation if you are new to this, before even try to write some code. That hour spent on reading the documentation will save you lot more hours later.