I am trying to scrape http://emojipedia.org/emoji/ , but I am not sure what is the most efficient way to do so. What I would like to scrape is found inside the table class ="emoji_list". I would like to save the stuff inside each "td" in separate columns. The output will be like the following where each line represent an emoji:
Col1_Link Col2_emoji Col3_Comment Col4_UTF
soup.findAll('tr', limit=2) won't do much considering that just gets the first two trs on the page. You need to first find all the rows of the table then extract what you want which is inside the two tds in each tr:
import requests from bs4 import BeautifulSoup url = "http://emojipedia.org/emoji/" html = requests.get(url).content soup = BeautifulSoup(html) table = soup.select_one("table.emoji-list") for row in table.find_all("tr")[:5]: td1, td2 = row.find_all("td") em, desc = td1.text.split(None, 1) print(td1.a["href"], em, desc, td2.text)
Another way would be to only get text without splitting would be to get the text from the a tag excluding the child text with
for row in table.find_all("tr"): td1, td2 = row.find_all("td") print(td1.a["href"], td1.a.span.text, td1.a.find(text=True, recursive=False), td2.text)
Also I would stick to using requests over urllib.