David Hancock David Hancock - 29 days ago 16
Python Question

Using beautiful soup to get values from cells in rows in tables

Working with the HTML from http://coinmarketcap.com/ i'm trying to create a python dictionary containing values from the HTML, for example:

{bitcoin: {Market_cap:'$11,247,442,728', Volume:'$64,668,900'}, ethereum: ....etc}

How ever i'm unfamiliar with how the HTML is structured. For some things like the market cap the cell (td) links to the data ie:

<td class="no-wrap market-cap text-right" data-usd="11247442728.0" data-btc="15963828.0">

$11,247,442,728

</td>


However for cells like the trading volume, the value is a link so the format is different ie:

<td class="no-wrap text-right">
<a href="/currencies/bitcoin/#markets" class="volume" data-usd="64668900.0" data-btc="91797.5">$64,668,900</a>
</td>


Here is the code I'm working with:

import requests
from bs4 import BeautifulSoup as bs

request = requests.get('http://coinmarketcap.com/')

content = request.content

soup = bs(content, 'html.parser')

table = soup.findChildren('table')[0]

rows = table.findChildren('tr')

for row in rows:
cells = row.findChildren('td')
for cell in cells:
print cell.string


This gives a result with loads of white space and missing data.

For each row how can I get the name of the coin?
For each cell how can I access each value ? whether it's a link () or a regular value

EDIT:

By changing the for loop to:

for row in rows:
cells = row.findChildren('td')
for cell in cells:
print cell.getText().strip().replace(" ", "")


I have able to get the data i want, ie:

1
Bitcoin
$11,254,003,178
$704.95
15,964,212
BTC
$63,057,100
-0.11%


However I would be cool to have the class names for each cell, ie

id: bitcoin
marketcap: 11,254,003,178
etc......

Answer

You're almost there. Instead of using the cell.string method, use cell.getText(). You probably need to do a bit of cleaning of the output strings as well to remove excess white space. I've used regex, but there's a few other options here as well depending on what state your data is in. I've added a bit of Python 3 compatibility as well with the print function.

from __future__ import print_function
import requests
import re

from bs4 import BeautifulSoup as bs

request = requests.get('http://coinmarketcap.com/')

content = request.content

soup = bs(content, 'html.parser')  

table = soup.findChildren('table')[0]

rows = table.findChildren('tr')

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        cell_content = cell.getText()
        clean_content = re.sub( '\s+', ' ', cell_content).strip()
        print(clean_content)

The table headings are stored in the first row, so you can extract them like so:

headers = [x.getText() for x in rows[0].findChildren('th')]