John Laudun John Laudun - 6 months ago 17
Python Question

Write series of strings (plus a number) to a line of csv

It's not pretty code, but I have some code that grabs a series of strings out of an HTML file and gives me a series of strings:

author
,
title
,
date
,
length
,
text
. I have 2000+ html files and I want go through all of them and write this data to a single csv file. I know all of this will have to be wrapped into a
for
loop eventually, but before then I am having a hard time understanding how to go from getting these values to writing them to a csv file. My thinking was to create a list or a tuple first and then write that to a line in a csv file:

the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r")
for x in length_data if re.search(r"(?s)\d{2}:\d{2}",
x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)


I can't for the life of me figure out how to get Python to respect the fact that these are strings and should be stored as strings and not as lists of letters. (The
.join()
above is me trying to figure this out.)

Looking ahead: is it better/more efficient to handle 2000 files this way, stripping them down to what I want and writing one line of the CSV at a time or is it better to build a data frame in
pandas
and then write that to CSV? (All 2000 files = 160MB, so stripped down, the eventual data can't be more than 100MB, so no great size here, but looking forward size may eventually become an issue.)

Answer

This will grab all the files and put the data into a csv, you just need to pass the path to the folder that contains the html files and the name of your output file:

import re
import csv
import os




def parse(soup):
    # both title and author are can be parsed in separate tags.
    author = soup.select_one("h4.h12.talk-link__speaker").text
    title = soup.select_one("h4.h9.m5").text
    # just need to strip the text from the date string, no regex needed.
    date = soup.select_one("span.meta__val").text.strip()      
    # we want the last time which is the talk-transcript__para__time previous to the footer.
    mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text.split(":"))
    length = (mn * 60 + sec)        
    # to ignore (Applause) etc.. we can just pull from the actual text fragment checking for (
    text = " ".join(d.text for d in soup.select("span.talk-transcript__fragment") if not d.text.startswith("("))        
    # clean the text
    text = re.sub('[^a-zA-Z\.\']', ' ', text)
    return  author.strip(), title.strip(), date, length, text


def to_csv(pth, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer. 
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["author", "title", "date", "length", "text"])
        # get all our html files.
        for html in os.listdir(pth):
            with open(os.path.join(pth, html)) as f:
                # parse the file are write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "lxml")))


to_csv("./test","output.csv")
Comments