confusedanalyst confusedanalyst - 3 months ago 26
Python Question

Scraping text from multiple web pages in Python

I've been tasked to scrape all the text off of any webpage a certain client of ours hosts. I've managed to write a script that will scrape the text off a single webpage, and you can manually replace the URL in the code each time you want to scrape a different webpage. But obviously this is very inefficient. Ideally, I could have Python connect to some list that contains all the URLs I need and it would iterate through the list and print all the scraped text into a single CSV. I've tried to write a "test" version of this code by creating a 2 URL long list and trying to get my code to scrape both URLs. However, as you can see, my code only scrapes the most recent url in the list and does not hold onto the first page it scraped. I think this is due to a deficiency in my print statement since it will always write over itself. Is there a way to have everything I scraped held somewhere until the loop goes through the entire list AND then print everything.

Feel free to totally dismantle my code. I know nothing of computer languages. I just keep getting assigned these tasks and use Google to do my best.

import urllib
import re
from bs4 import BeautifulSoup

data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv'
urlTable = ['url1','url2']

def extractText(string):
page = urllib.request.urlopen(string)
soup = BeautifulSoup(page, 'html.parser')

##Extracts all paragraph and header variables from URL as GroupObjects
text = soup.find_all("p")
headers1 = soup.find_all("h1")
headers2 = soup.find_all("h2")
headers3 = soup.find_all("h3")

##Forces GroupObjects into str
text = str(text)
headers1 = str(headers1)
headers2 = str(headers2)
headers3 = str(headers3)

##Strips HTML tags and brackets from extracted strings
text = text.strip('[')
text = text.strip(']')
text = re.sub('<[^<]+?>', '', text)

headers1 = headers1.strip('[')
headers1 = headers1.strip(']')
headers1 = re.sub('<[^<]+?>', '', headers1)

headers2 = headers2.strip('[')
headers2 = headers2.strip(']')
headers2 = re.sub('<[^<]+?>', '', headers2)

headers3 = headers3.strip('[')
headers3 = headers3.strip(']')
headers3 = re.sub('<[^<]+?>', '', headers3)

print_to_file = open (data_file_name, 'w' , encoding = 'utf')
print_to_file.write(text + headers1 + headers2 + headers3)
print_to_file.close()


for i in urlTable:
extractText (i)

Answer

Try this, 'w' will open the file with a pointer at the the beginning of the file. You want the pointer at the end of the file

print_to_file = open (data_file_name, 'a' , encoding = 'utf')

here is all of the different read and write modes for future reference

The argument mode points to a string beginning with one of the following
 sequences (Additional characters may follow these sequences.):

 ``r''   Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w''   Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a''   Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+''  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.
Comments