Sfinos Sfinos - 1 year ago 86
Python Question

Parse a webpage with BeautifulSoup4

I try to parse this webpage from Coursera and download all the visible text from the page. Unfortunately, BeautifulSoup4 doesn't seem to work and I don't know what else to do. Let me explain

Here is the code:

from bs4 import BeautifulSoup
import urllib2

link = "https://www.coursera.org/course/nlp"
req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"})
usock = urllib2.urlopen(req)
page = usock.read()

soup = BeautifulSoup(page)

variable doesn't contain any of the text from the webpage. I tried with 'lxml', 'xml' and 'html5lib' parsers but without any success.

Answer Source

That page loads all its contents as JSON over AJAX and builds the page in the browser. The HTML page itself is little more than a vehicle to load the Javascript.

When you view the network activity in the browser development console, three URLs stand out:


Each returns JSON data; most likely you'll find whatever you wanted to scrape in those responses, neatly prepackaged and no HTML scraping required.

Loading the nlp topic information for example is as simple as:

import json

link = "https://www.coursera.org/maestro/api/topic/information?topic-id=nlp"
data = json.load(urllib2.urlopen(link))

In an interactive session, import pprint and take a look at the contents:

>>> from pprint import pprint
>>> pprint(data)
{u'about_the_course': u"<p>This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.</p>\nWe are offering this course on Natural Language Processing free and online to students worldwide, continuing Stanford's exciting forays into large scale online instruction. Students have access to screencast lecture videos, are given quiz questions, assignments and exams, receive regular feedback on progress, and can participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professors Jurafsky and Manning, the curriculum draws from Stanford's courses in Natural Language Processing. You will need a decent internet connection for accessing course materials, but should be able to watch the videos on your smartphone.&nbsp;<br><br>\n<p><strong>&nbsp;</strong></p>",
 u'about_the_instructor': u'<p>Professors Jurafsky and Manning are the leading natural language processing educators, through their textbooks on natural language processing, speech, and information retrieval.</p>\n<p><img src="http://spark-public.s3.amazonaws.com/nlp/landing/jurafsky.png" class="coursera-instructor-thumb"> <a href="http://www.stanford.edu/~jurafsky/">Dan Jurafsky</a> is Professor of Linguistics and Professor by Courtesy of Computer Science at Stanford University. Dan received his Bachelor\'s degree in Linguistics in 1983 and his Ph.D. in Computer Science in 1992, both from the University of California at Berkeley, and also taught at the University of Colorado, Boulder before joining the Stanford faculty in 2004. He is the recipient of a MacArthur Fellowship and has served on a variety of editorial boards, corporate advisory boards, and program committees. Dan\'s research extends broadly throughout natural language processing as well as its application to the behavioral and social sciences.</p>\n<p><img src="http://spark-public.s3.amazonaws.com/nlp/landing/manning.png" class="coursera-instructor-thumb"> <a href="http://nlp.stanford.edu/~manning/">Christopher Manning</a> is an Associate Professor of Computer Science and Linguistics at Stanford University. Chris received a Bachelors degree and University Medal from the Australian National University and a Ph.D. from Stanford in 1994, both in Linguistics. Chris taught at Carnegie Mellon University and The University of Sydney before joining the Stanford faculty in 1999. He is a Fellow of the American Association for Artificial Intelligence and of the Association for Computational Linguistics, and is one of the most cited authors in natural language processing, for his research on a broad range of statistical natural language topics from tagging and parsing to grammar induction and text understanding.</p>',
 u'categories': [{u'description': u'Our wide range of courses allows students to explore topics from many different fields of study. Sign up for a class today and join our global community of students and scholars!',
                  u'id': 17,
                  u'mailing_list_id': None,
                  u'name': u'Computer Science: Artificial Intelligence',
                  u'short_name': u'cs-ai'}],
 u'category-ids': [u'cs-ai'],
 u'course-ids': [20, 88],
 u'course_format': u'',
 u'course_syllabus': u'<p>The following topics will be covered in the first two weeks:</p>\n<ol>\n<li><b>Introduction and Overview:</b></li>\n<li><b>Basic Text Processing:&nbsp;</b>J+M Chapters 2.1, 3.9; MR+S Chapters 2.1-2.2</li>\n<li><b>Minimum Edit Distance:&nbsp;</b>J+M Chapter 3.11</li>\n<li><b>Language Modeling:&nbsp;</b>J+M Chapter 4</li>\n<li><b>Spelling Correction:</b>&nbsp;J+M Chapters 5.9,&nbsp;<a href="http://norvig.com/spell-correct.html">Peter Norvig (2007) How to Write a Spelling Corrector</a></li>\n</ol>\n<div class="coursera-course-faq"></div>',
 u'courses': [{u'ace_close_date': None,
               u'ace_open_date': None,
               u'ace_semester_hours': None,
               u'ace_track_price_display': None,
               u'active': True,
               u'auth_review_completion_date': u'2010-01-01',
               u'certificate_description': u'Course content included the topics of spelling correction, sentiment analysis, information extraction, syntactic parsing, meaning extraction and question answering, based on underlying theory drawn from probability, statistics, linguistics, and algorithms.',
               u'certificate_ready_user_id': None,
               u'certificates_ready': True,
               u'chegg_session_id': u'',
               u'creator_id': None,
               u'deployed': True,
               u'duration_string': u'8 weeks',
               u'eligible_for_ACE': False,
               u'eligible_for_certificates': True,
               u'eligible_for_signature_track': False,
               u'end_date': None,
               u'end_of_class_emails_sent': u'2010-01-01',
               u'grades_release_date': u'2012-12-06',
               u'grading_policy_distinction': u'N/A',
               u'grading_policy_normal': u'To successfully complete this 8-week online class, students were required to watch 16 hours of lecture, complete 8 problem sets, and code a series of 8 substantial programming assignments in Java or Python while scoring at least 70% of the maximum possible points. ',
               u'home_link': u'https://class.coursera.org/nlp/',
               u'id': 20,
               u'instructors': [775, 68850],
               u'name': u'12-001',
               u'notified_subscribers': True,
               u'proctored_exam_completion_date': None,
               u'signature_track_additional_notes': u'',
               u'signature_track_certificate_combined_signature': u'',
               u'signature_track_certificate_design_id': None,
               u'signature_track_certificate_signature_blurb': u'',
               u'signature_track_close_time': None,
               u'signature_track_last_chance_time': None,
               u'signature_track_last_refund_date': None,
               u'signature_track_open_time': None,
               u'signature_track_price': None,
               u'signature_track_registration_open': False,
               u'signature_track_regular_price': None,
               u'start_date': None,
               u'start_date_string': u'12 March 2012',
               u'start_day': 12,
               u'start_month': 3,
               u'start_year': 2012,
               u'statement_design_id': 8,
               u'status': 0,
               u'textbooks': [],
               u'topic_id': 7,
               u'university_logo': u''},
              {u'ace_close_date': None,
               u'ace_open_date': None,
               u'ace_semester_hours': None,
               u'ace_track_price_display': None,
               u'active': False,
               u'auth_review_completion_date': None,
               u'certificate_description': u'',
               u'certificate_ready_user_id': None,
               u'certificates_ready': False,
               u'chegg_session_id': u'',
               u'creator_id': None,
               u'deployed': True,
               u'duration_string': u'8 weeks',
               u'eligible_for_ACE': False,
               u'eligible_for_certificates': True,
               u'eligible_for_signature_track': False,
               u'end_date': None,
               u'end_of_class_emails_sent': None,
               u'grades_release_date': None,
               u'grading_policy_distinction': u'',
               u'grading_policy_normal': u'',
               u'home_link': u'https://class.coursera.org/nlp-002/',
               u'id': 88,
               u'instructors': [775, 68850],
               u'name': u'002',
               u'notified_subscribers': False,
               u'proctored_exam_completion_date': None,
               u'signature_track_additional_notes': None,
               u'signature_track_certificate_combined_signature': u'',
               u'signature_track_certificate_design_id': None,
               u'signature_track_certificate_signature_blurb': u'',
               u'signature_track_close_time': None,
               u'signature_track_last_chance_time': None,
               u'signature_track_last_refund_date': None,
               u'signature_track_open_time': None,
               u'signature_track_price': None,
               u'signature_track_registration_open': False,
               u'signature_track_regular_price': None,
               u'start_date': None,
               u'start_date_string': u'',
               u'start_day': None,
               u'start_month': None,
               u'start_year': None,
               u'statement_design_id': None,
               u'status': 0,
               u'textbooks': [],
               u'topic_id': 7,
               u'university_logo': None}],
 u'description': u'',
 u'display': True,
 u'estimated_class_workload': u'8-10 hours/week',
 u'faq': u"<ul>\n<li><strong>Will I get a statement of accomplishment after completing this class?</strong>\n<p>Yes. Students who successfully complete the class will receive a statement of accomplishment signed by the instructor.</p>\n</li>\n<li><strong>What is the format of the class?</strong>\n<p>The class will consist of lecture videos, which are broken into small chunks, usually between 8 and 12 minutes each. Some of these may contain integrated quiz questions. There will also be standalone quizzes that are not part of video lectures, and programming assignments.</p>\n</li>\n<li><strong>How much work will I be expected to do in this class?</strong>\n<p>You need to work about 10 hours a week to complete the course.</p>\n</li>\n<ul>\n<li>About 2 hours of video segments each week, containing inline ungraded quiz questions.</li>\n<li>A weekly, graded multiple choice and short answer problem set (about 1 hour to complete).</li>\n<li>A substantial weekly programming assignment (about 6 hours to complete).</li>\n</ul>\n<li>\n<p><strong>Why Study Natural Language Processing?</strong></p>\nNatural language processing is the technology for dealing with our most ubiquitous product: human language, as it appears in emails, web pages, tweets, product descriptions, newspaper stories, social media, and scientific articles, in thousands of languages and varieties. In the past decade, successful natural language processing applications have become part of our everyday experience, from spelling and grammar correction in word processors to machine translation on the web, from email spam detection to automatic question answering, from detecting people's opinions about products or services to extracting appointments from your email. In this class, you'll learn the fundamental algorithms and mathematical models for human language processing and how you can use them to solve practical problems in dealing with language data wherever you encounter it.</li>\n</ul>",
 u'has_full_data': True,
 u'id': 7,
 u'instructor': u'Dan Jurafsky, Professor. Christopher Manning, Associate Professor',
 u'language': u'en',
 u'large_icon': u'https://s3.amazonaws.com/coursera/topics/nlp/large-icon.png',
 u'name': u'Natural Language Processing',
 u'other_description': u'',
 u'photo': u'https://s3.amazonaws.com/coursera/topics/nlp/large-icon.png',
 u'preview_link': u'https://class.coursera.org/nlp/lecture/preview',
 u'recommended_background': u'<p>No background in natural language processing is required. Students will be expected to know a bit of basic probability (know Bayes rule), a bit about vectors and vector spaces (could length normalize a vector), a bit of calculus (know that the derivative of a function is zero at a maximum or minimum of a function), but we will review these concepts as we first use them. You should have reasonable programming ability (know about hash tables and graph data structures), be able to write programs in Java or Python, and have a computer (Windows, Mac or Linux) with internet access.</p>\n<p></p>',
 u'self_service_course_id': None,
 u'short_description': u'In this class, you will learn fundamental algorithms and mathematical models for processing natural language, and how these can be used to solve practical problems.',
 u'short_name': u'nlp',
 u'small_icon': u'https://s3.amazonaws.com/coursera/topics/nlp/small-icon.hover.png',
 u'small_icon_hover': u'https://s3.amazonaws.com/coursera/topics/nlp/small-icon.hover.png',
 u'specializations': [],
 u'subtitle_languages_csv': u'',
 u'suggested_readings': u'<p>We will provide detailed lecture notes of all the technical content, which will be yours to keep after the end of class. Many students do fine just working from the lectures and notes. But others find it very useful to have an accompanying textbook, for reinforcing the core material, as a source of additional exercises, and as a reference for the future.</p>\nTo prepare for the class in advance, you may consider reading through some sections of the textbooks (<a href="http://www.tqlkg.com/click-7115529-10692263?sid=nlp&URL=http://www.chegg.com/textbooks/9780131873216">Jurafsky and Martin, Speech and Language Processing 2nd Edition</a>, and&nbsp;<a href="http://www.tqlkg.com/click-7115529-10692263?sid=nlp&URL=http://www.chegg.com/textbooks/9780521865715/">Manning, Sch\xfctze and Raghavan 2008</a>). Or, if you\'re rusty or not very experienced in either Java or Python, it\'d be great to work through early parts of&nbsp;<a href="http://www.tqlkg.com/click-7115529-10692263?sid=nlp&URL=http://www.chegg.com/textbooks/9780596516499">Bird, Klein and Loper 2009</a>',
 u'target_audience': 1,
 u'translate': False,
 u'universities': [{u'abbr_name': u'Stanford',
                    u'background_color': u'',
                    u'banner': u'',
                    u'china_mirror': 2,
                    u'class_logo': u'https://coursera-university-assets.s3.amazonaws.com/21/9a0294e2bf773901afbfcb5ef47d97/Stanford_Coursera-200x48_RedText_BG.png',
                    u'description': u'The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California on an 8,180-acre (3,310 ha) campus near Palo Alto, California, United States.',
                    u'display': True,
                    u'favicon': u'https://coursera-university-assets.s3.amazonaws.com/dc/581cda352d067023dcdcc0d9efd36e/favicon-stanford.ico',
                    u'home_link': u'http://online.stanford.edu/',
                    u'id': 1,
                    u'landing_page_banner': u'',
                    u'location': u'Palo Alto, CA, United States',
                    u'location_city': u'Palo Alto',
                    u'location_country': u'US',
                    u'location_lat': 37.4418834,
                    u'location_lng': -122.14301949999998,
                    u'location_state': u'CA',
                    u'logo': u'https://coursera-university-assets.s3.amazonaws.com/d8/4c69670e0826e42c6cd80b4a02b9a2/stanford.png',
                    u'mailing_list_id': None,
                    u'name': u'Stanford University',
                    u'partner_type': 1,
                    u'primary_color': u'#8C1515',
                    u'rectangular_logo_svg': u'',
                    u'short_name': u'stanford',
                    u'square_logo': u'',
                    u'square_logo_source': u'',
                    u'square_logo_svg': u'',
                    u'website': u'',
                    u'website_facebook': u'',
                    u'website_twitter': u'',
                    u'website_youtube': u'',
                    u'wordmark': None}],
 u'university-ids': [u'stanford'],
 u'university_logo': u'',
 u'university_logo_st': None,
 u'video': u'Fnr4A0tcU-M',
 u'video_baseurl': u'https://d1a2y8pfnfh44t.cloudfront.net/Fnr4A0tcU-M/',
 u'video_id': u'Fnr4A0tcU-M',
 u'visibility': 0}