Meli Meli - 2 months ago 6
R Question

download files in R while maintaining original structure

The code example below is to download census data by state from the census.gov website. The problem I run into is that the leading zeros are deleted when I download as a csv file. How do I maintain the original structure of the files while downloading?

stateFIPScodes<-seq(10,13,1)

for(i in 1:length(stateFIPScodes)){
URL<-paste("https://www.census.gov/popest/data/intercensal/county/files/CO-EST00INT- ALLDATA-",stateFIPScodes[i],".csv",sep="" )
destfile<-paste("state2000_2010_",stateFIPScodes[i],".csv" ,sep="") # CSV files drop leading zero!!
download.file(URL, destfile)
}


Thank you!

Answer

It appears as if the files are downloaded correctly, but you are correct: if you use read.csv to load the data into R, then some of the columns are interpreted as numeric, and therefore they lose the leading zeros.

To take your code that downloads files -

stateFIPScodes<-seq(10,13,1)

for(i in seq_along(stateFIPScodes)){
  code <- stateFIPScodes[[i]]
  URL <- paste0("https://www.census.gov/popest/data/intercensal/county/files/CO-EST00INT-ALLDATA-", code, ".csv")
  destfile <- paste0("state2000_2010_",code,".csv") 
  download.file(URL, destfile)
}

If we use base read.csv we get no trailing zeros:

library(dplyr)

read.csv("state2000_2010_10.csv") %>%
  select(1:5) %>%
  head

#>   SUMLEV STATE COUNTY   STNAME     CTYNAME
#> 1     50    10      1 Delaware Kent County
#> 2     50    10      1 Delaware Kent County
#> 3     50    10      1 Delaware Kent County
#> 4     50    10      1 Delaware Kent County
#> 5     50    10      1 Delaware Kent County
#> 6     50    10      1 Delaware Kent County

That's because the first few columns are being read in as numeric.

read.csv("state2000_2010_10.csv") %>% str()
#> 'data.frame':    780 obs. of  50 variables:
#>  $ SUMLEV      : int  50 50 50 50 50 50 50 50 50 50 ...
#>  $ STATE       : int  10 10 10 10 10 10 10 10 10 10 ...
#>  $ COUNTY      : int  1 1 1 1 1 1 1 1 1 1 ...

There are two ways to resolve this:

  1. Manually pass the data types into read.csv, or just prevent all conversions by adding colClasses = "character".
  2. Use readr::read_csv which handles it correctly.

We could just prevent automatic coercion:

read.csv("state2000_2010_10.csv", colClasses = "character") %>% str()
#> 'data.frame':    780 obs. of  50 variables:
#>  $ SUMLEV      : chr  "050" "050" "050" "050" ...
#>  $ STATE       : chr  "10" "10" "10" "10" ...
#>  $ COUNTY      : chr  "001" "001" "001" "001" ...
#>  $ STNAME      : chr  "Delaware" "Delaware" "Delaware" "Delaware" ...

You would need to choose what columns you wanted to cast to as.numeric.

Or you could select the columns, e.g.

read.csv("state2000_2010_10.csv", 
         colClasses = c(
           SUMLEV = "character",
           STATE = "numeric",
           COUNTY = "character"
         )) %>%
  select(1:5) %>%
  head

#>   SUMLEV STATE COUNTY   STNAME     CTYNAME
#> 1    050    10    001 Delaware Kent County
#> 2    050    10    001 Delaware Kent County
#> 3    050    10    001 Delaware Kent County
#> 4    050    10    001 Delaware Kent County
#> 5    050    10    001 Delaware Kent County
#> 6    050    10    001 Delaware Kent County

Second, you could use readr, which has more intelligent column type inference:

#> read_csv("state2000_2010_10.csv") %>%
#>    select(1:5) %>%
#>    head
#> # A tibble: 6 × 5
#>   SUMLEV STATE COUNTY   STNAME     CTYNAME
#>    <chr> <int>  <chr>    <chr>       <chr>
#> 1    050    10    001 Delaware Kent County
#> 2    050    10    001 Delaware Kent County
#> 3    050    10    001 Delaware Kent County
#> 4    050    10    001 Delaware Kent County
#> 5    050    10    001 Delaware Kent County
#> 6    050    10    001 Delaware Kent County