Topto Topto - 29 days ago 9
Python Question

Writing tabular data to a csv file from a webpage

I've written a script in python to parse some data from a webpage and write it to a csv file via panda. So far what I've written can parse all the tables available in that page but in case of writing to a csv file it is writing the last table from that page to that csv file. Definitely, the data are being overwritten because of the loop. How can I fix this flaw so that my scraper will be able to write all the data from different tables instead of only the last table? Thanks in advance.

import csv
import requests
from bs4 import BeautifulSoup
import pandas as pd

res = requests.get('').text
soup = BeautifulSoup(res,"lxml")
for table in soup.find_all("table"):
df = pd.read_html(str(table))[0]

Btw, I expect to write data to a csv file using panda only. Thanks again.

Answer Source

You can use read_html what return list of DataFrames in webpage, so then need concat for one df:

dfs = pd.read_html('')

df = pd.concat(dfs, ignore_index=True)
#if necessary rename columns
d = {'Unnamed: 1':'a', 'Unnamed: 7':'b'}
df = df.rename(columns=d)
print (df.head())
               matchup               a  time (ET)  nat tv  away tv  home tv  \
0          Atlanta ATL       Miami MIA        NaN     NaN      NaN      NaN   
1               LA LAC     Toronto TOR        NaN     NaN      NaN      NaN   
2  Guangzhou Guangzhou  Washington WSH        NaN     NaN      NaN      NaN   
3        Charlotte CHA      Boston BOS        NaN     NaN      NaN      NaN   
4          Orlando ORL     Memphis MEM        NaN     NaN      NaN      NaN   

                           tickets   b  
0  2,401 tickets available from $6 NaN  
1   284 tickets available from $29 NaN  
2  2,792 tickets available from $2 NaN  
3  2,908 tickets available from $6 NaN  
4  1,508 tickets available from $3 NaN  

And last to_csv for write to file:

df.to_csv("table_item.csv", index=False)


For learning is possible append each DataFrame to list and then concat:

res = requests.get('').text
soup = BeautifulSoup(res,"lxml")
dfs = []
for table in soup.find_all("table"):
    df = pd.read_html(str(table))[0]

df = pd.concat(dfs, ignore_index=True)