Lee Lee - 4 months ago 7
HTML Question

Reading only some columns when using readHTMLTable to get tables from a website

I am trying to read the table from this website.

http://www.databaseolympics.com/games/gamessport.htm?g=1&sp=ATH

The problem is I only want the first column (Event) and the last column (Medal) to be read.
This is my code and result:

temp_URL<-'http://www.databaseolympics.com/games/gamessport.htm?g=1&sp=ATH'
tab<-readHTMLTable(temp_URL, which=3,colClasses = c('factor',NULL,NULL,NULL,'factor'))
head(tab)

Event Athlete Country Result Medal
1 100m Men Tom Burke USA 12.0 GOLD
2 Fritz Hofmann DEU 12.2 est. SILVER
3 Francis Lane USA 12.6 BRONZE
4 Alajos Szokolyi HUN 12.6 est. BRONZE
5 400m Men Tom Burke USA 54.2 GOLD
6 Herbert Jamison USA n/a SILVER


As you can see it returns all the columns of the table. I read on the R Documentation that using colClasses and stating a Null value should make R ignore that column but it is not working for me. I realize once you have the data in R it is very easy to just create a new data frame with the desired columns:

tab<-data.frame(tab$Event,tab$Medal)
head(tab)
tab.Event tab.Medal
1 100m Men GOLD
2 SILVER
3 BRONZE
4 BRONZE
5 400m Men GOLD
6 SILVER


I would really like to avoid this extra step and find a way in which only the desired data comes in to R, the reason for this is beause this page is a part of a code that needs to read thousands of pages and that extra step could be time consuming when running for multiple times.

Answer

Use list instead of vector:

temp_URL<-'http://www.databaseolympics.com/games/gamessport.htm?g=1&sp=ATH'

tab<-readHTMLTable(temp_URL, which=3,colClasses = list("factor",NULL,NULL,NULL,"factor"),stringsAsFactors = FALSE)

head(tab)
                V1     V2
        1 100m Men   GOLD
        2          SILVER
        3          BRONZE
        4          BRONZE
        5 400m Men   GOLD
        6          SILVER
Comments