theprowler theprowler - 1 day ago 4
HTML Question

Neither split("\n") nor splitlines() work to split a string up

My question is simply: if neither of the above commands work on splitting up a string into multiple lines, does that mean that nothing is delimiting the string?

My example is pretty in depth but in short: I have parsed specific data out of an HTML table with BeautifulSoup, but when I go to print the data it is all one messy string instead of a neat table format. I tried converting it to a Pandas DataFrame but still no success. I then tried using the above commands to neaten up the output but those also failed. This all leads me to believe it must in fact be one continuous string with no delimiters (even though obviously in the table they are separate entries).

I would love help with this problem. I am not sure if I'm using the commands wrong, or if my data really is this difficult to work with. Thank you.

My data (and how I expect it should be printed):

desired output

My relevant code:

rows = table.findAll("tr")[1:2]
data = {
'ID' : [],
'Available Quota' : [],
'Live Weight Pounds' : [],
'Price' : [],
'Date Posted' : []
}

for row in rows:
cols = row.findAll("td")
data['ID'].append(cols[0].get_text())
data['Available Quota'].append(cols[1].get_text())
data['Live Weight Pounds'].append(cols[2].get_text())
data['Price'].append(cols[3].get_text())
data['Date Posted'].append(cols[4].get_text())

fishData = pd.DataFrame(data)
#print(fishData)
str1 = ''.join(data['Available Quota'])
#print(type(str1))
#str1.split("\n")
str1.splitlines()
print(str1)


What gets printed:

GOM CODGOM HADDDABSGOM YT

Answer

My guess is that there's some formatting happening inside the table cells that you're throwing away. Supposing that the four lines visible in your table cell are separated by <br> tags, BeautifulSoup will discard that information when you call get_text:

>>> s = 'First line <br />Second line <br />Third line'
>>> soup = BeautifulSoup(s)
>>> soup.get_text()
u'First line Second line Third line'

As noted over here, you can swap out <br> tags for newlines, which might make your life easier:

>>> for br in soup.find_all("br"):
...     br.replace_with("\n")
>>> soup.get_text()
u'First line \nSecond line \nThird line'

The strings and stripped_strings generators might also be useful here; they return chunks of text which were originally separated by tags:

>>> soup = BeautifulSoup(s)
>>> list(soup.stripped_strings)
[u'First line', u'Second line', u'Third line']

So, what happens if you do:

data['Available Quota'].extend(cols[1].stripped_strings)

Hopefully, you should have the list you're looking for in data['Available Quota']:

>>> data['Available Quota']
['GOM', 'CODGOM', 'HADDDABSGOM', 'YT']
Comments