strahanstoothgap strahanstoothgap - 1 month ago 14
Python Question

Python to CSV is splitting string into two columns when I want one

I am scraping a page with BeautifulSoup, and part of the logic is that sometimes part of the contents of a tag can have a

<br>
in it.

So sometimes it looks like this:

<td class="xyz">
text 1
<br>
text 2
</td>


and sometimes it looks like this:

<td class="xyz">
text 1
</td>


I am looping through this and adding to an output_row list that I eventually add to a list of lists. Whether I see the former format or the latter, I want the text to be in one cell.

I've found a way to determine if I am seeing the
tag because the td.string shows up as none and I also know that text 2 always has 'ABC' in it. So:

elif td.string == None:
if 'ABC' in td.contents[2]:
new_string = td.contents[0] + ' ' + td.contents[2]
output_row.append(new_string)
print(new_string)
else:
#this is for another situation and it works fine


As I print this in a Jupyter Notebook, it shows up as "text 1 text 2" as one line. But when I open up my CSV, it is in two different columns. So when td.string has contents (meaning no
tag), text 1 shows up in one column, but when I get to the pieces that have a
tag, all my data gets shifted.

I'm not sure why it shows up as two different strings (two columns) when I concatenate them before appending them to the list.

I'm writing to file like this:

with open('C:/location/file.csv', 'w',newline='') as csv_file:
writer=csv.writer(csv_file,delimiter=',')
#writer.writerow(headers)
for row in output_rows:
writer.writerow(row)

csv_file.close

Answer

You can handle both cases using get_text() with "strip" and "separator":

from bs4 import BeautifulSoup

dat="""
<table>
    <tr>
        <td class="xyz">
            text 1
            <br>
            text 2
        </td>

        <td class="xyz">
            text 1
        </td>
    </tr>
</table>
"""

soup = BeautifulSoup(dat, 'html.parser')
for td in soup.select("table > tr > td.xyz"):
    print(td.get_text(separator=" ", strip=True))

Prints:

text 1 text 2
text 1
Comments