Winterflags Winterflags - 6 months ago 144
Python Question

Pandas: Append rows to DataFrame already running through pandas.DataFrame.apply

Brief:
I am using

Selenium WebDriver
and
Pandas
for Python 2.7 to make a web scraper that goes to a sequence of URLs and scrapes URLs on that page. If it finds URLs there, I want them to be added to the running sequence. How can I do this using
pandas.DataFrame.apply
?




Code:

import pandas as pd
from selenium import webdriver
import re

df = pd.read_csv(spreadsheet.csv, delimiter=",")

def crawl(use):
url = use["URL"]
driver.get(url)
scraped_urls = re.findall(r"(www.+)", element.text)
something_else = "foobar"

#Ideally the scraped_urls list would have to be unpacked here
return pd.Series([scraped_urls, something_else])

df[["URL", "Something else"]] = df["URL"].apply(crawl)

df.to_csv("result.csv", delimiter=",")


The above scraper uses the column
"URL"
in
spreadsheet.csv
to navigate to each new
url
.
It then scrapes all strings on the page that matches the regex
www.+
to find URLs, and puts the results in the list
scraped_urls
.

It also gets the string
something_else = "foobar"
.

When it has processed all the cells in
"URL"
it writes a new file
result.csv
.




My problem:

I have had difficulties finding a way to add the scraped URLs in the list
scraped_urls
to the column
"URL"
– so that they are inserted just below the "active" URL (retrieved with
use["URL"]
).

If the column in the source spreadsheet looks like this:

["URL"]
"www.yahoo.com"
"www.altavista.com"
"www.geocities.com"


And on www.yahoo.com, the scraper finds these strings via regex:

"www.angelfire.com"
"www.gamespy.com"


I want to add these as rows to the column
"URL"
below
www.yahoo.com
, so that they become the next keyword for the scraper to search:

["URL"]
"www.yahoo.com" #This one is done
"www.angelfire.com" #Go here now
"www.gamespy.com" #Then here
"www.altavista.com" #Then here
"www.geocities.com" #...


Is this possible? Can I on-the-fly append the DataFrame
df
that is already being run through
apply()
?

Answer

I don't think there is a way to use apply the way you envision. And even if there were a way,

  • it would most likely require keeping track of how many items have already been iterated over so you would know where to insert new items into df['URL']

  • inserting into the middle of df['URL'] would require copying all the data from the current DataFrame into a new, larger DataFrame. Copying the whole DataFrame (potentially) once for every row would make the code unnecessarily slow.

Instead, a simpler, better way is to use a stack. The stack can be implemented by a simple list. You can push df['URL'] onto the stack, then pop a url off the stack and process it. When new scraped urls are found, they can be pushed onto the stack and be the next items to be popped off:

import pandas as pd

def crawl(url_stack):
    url_stack = list(url_stack)
    result = []
    while url_stack:
        url = url_stack.pop()
        driver.get(url)
        scraped_urls = ...
        url_stack.extend(scraped_urls)

        something_else = "foobar"
        result.append([url, something_else])
    return pd.DataFrame(result, columns=["URL", "Something else"])

df = pd.read_csv(spreadsheet.csv, delimiter=",")
df = crawl(df['URL'][::-1])
df.to_csv("result.csv", delimiter=",")
Comments