Rrj17 Rrj17 - 3 years ago 57
Python Question

Loading more content in a webpage and issues writing to a file

I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.

I am currently stuck with 2 issues.


  1. Only the first few links are scraped. I'm unable to extract links
    from other pages(Website contains load more button). I don't know
    how to use the XHR object in the code.

  2. The second half of the code reads only the last link(stored in the
    csv file), scrapes the respective information and stores it in a
    text file. It does not go through all the links from the beginning.
    I am unable to figure out where I have gone wrong in terms of file
    handling and f.seek(0).

    from pprint import pprint
    import requests
    import lxml
    import csv
    import urllib2
    from bs4 import BeautifulSoup

    def get_url_for_search_key(search_key):
    base_url = 'http://www.marketing-interactive.com/'
    response = requests.get(base_url + '?s=' + search_key)
    soup = BeautifulSoup(response.content, "lxml")
    return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
    results = soup.findAll('a', {'rel': 'bookmark'})

    for r in results:
    if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
    newlinks.append(r["href"])

    pprint(get_url_for_search_key('digital advertising'))
    with open('ctp_output.csv', 'w+') as f:
    f.write('\n'.join(get_url_for_search_key('digital advertising')))
    f.seek(0)


    Reading CSV file, scraping respective content and storing in .txt file



    with open('ctp_output.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
    url = line[0]
    soup = BeautifulSoup(urllib2.urlopen(url))

    with open('ctp_output.txt', 'a+') as f2:
    for tag in soup.find_all('p'):
    f2.write(tag.text.encode('utf-8') + '\n')


Answer Source

Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.

with open('ctp_output.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url))

        with open('ctp_output.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.

For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download