Fruitspunchsamurai Fruitspunchsamurai - 2 months ago 14
Python Question

Scraping values from a webpage table

I want to create a python dictionary of color names to background color from this color dictionary.

What is the best way to access the color name strings and the background color hex values? I want to create a mapping for color name --> hex values, where 1 color name maps to 1 or more hex values.

The following is my code:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.text)


I'm not sure how to specify what to scrape from the table. I've tried the following to get a format that's useful:

soup.td
<td nowrap="" width="175*">abbey</td>


soup.get_text()
"(M)\n td { padding: 0 10px; } \n\n(M) Dictionary of Color Maerz and Paul, Dictionary of Color, 1st ed. \n\nabbey207\nabsinthe [green] 120\nabsinthe yellow105\nacacia101102\nacademy blue173\nacajou43\nacanthe95\nacier109\nackermann's green137\naconite violet223....
.............\nyolk yellow84\nyosemite76\nyucatan5474\nyucca150\nyu chi146\nyvette violet228\n\nzaffre blue 179182\nzanzibar47\nzedoary wash71\nzenith [blue] 199203\nzephyr78\nzinc233265\nzinc green136\nzinc orange5053\nzinc yellow84\nzinnia15\nzulu47\nzuni brown58\n\n"


soup.select('tr td')
[...
<td nowrap="" width="175*">burnt russet</td>,
<td style="background-color:#722F37; color:#FFF" title="16">16</td>,
<td style="background-color:#79443B; color:#FFF" title="43">43
</td>,
<td nowrap="" width="175*">burnt sienna</td>,
<td style="background-color:#9E4732; color:#FFF" title="38">38
</td>,
...]


EDIT:
I want to scrape the strings in the td elements e.g "burnt russet" as the color and the string (hex component) in the following td elements where the "style" attribute is specified as the background color.

I want the dictionary to look as follows:

color_map = {'burnt russet': [#722F37, #79443B], 'burnt sienna': [#9E4732]}

Answer

The webpage that you are trying to scrape is horribly-formed HTML. After View Page Source, it is apparent that most rows start with a <tr> and then have one or more <td> elements, all without their closing tags. With BeautifulSoup one should specify an HTML parser, and for the case at hand, we had better hope the parser can understand the table structure.

I present a solution that relies on the structured format of the webpage itself. Instead of parsing the webpage as HTML, I use the fact that each color has its own line and each line has a common format.

import re
import requests

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
lines = page.text.splitlines()

opening = '<tr><td width="175*" nowrap>'
ending = '<td title="'
bg_re = r'style="background-color:(#.{6})'
color_map = dict()
for line in lines:
    if line.startswith(opening):
        color_name = line[len(opening):line.find(ending)].strip()
        color_hex = [match.group(1) for match in re.finditer(bg_re, line)]
        if color_name in color_map: 
            color_map[color_name].extend(color_hex)  # Note: some colors are repeated
        else:
            color_map[color_name] = color_hex

color_map['burnt russet']
## ['#722F37', '#79443B']

Quick and dirty, but it works.

Comments