skyindeer skyindeer - 1 year ago 35
R Question

Use R to clean a raw, old data

So I have a quite old, raw data that looks like the following:

1
*******
*******
*******
*******
S H
HHHHH
2
*******
JSH K
*******
*******
*******
*******


The first row has one number
1
, which is an ID. The following 2~7 rows each is supposed to have 7 elements, corresponding to 7 categories, say a1,a2,a3,a4,a5,a6,a7. Row 8 is again an ID. So for each individual we have 6 rows.

At the end of the day, I want an output looks like this

ID a1 a2 a3 a4 a5 a6 a7
1 1 * * * * * * *
2 1 * * * * * * *
3 1 * * * * * * *
4 1 * * * * * * *
5 1 <NA> <NA> S <NA> <NA> H <NA>
6 1 <NA> H H H H H <NA>
7 2 * * * * * * *
8 2 J S H <NA> <NA> <NA> K
9 2 * * * * * * *
10 2 * * * * * * *
11 2 * * * * * * *
12 2 * * * * * * *


The data does not have any filename extensions (like .csv or .txt). So the first question is how to read it in R while keeping the data formation unchanged.

I tried to use
read.csv()
, but it will make the 6th row become

SH


which assign S into 1st category instead of 3rd and H into 2nd category instead of 6th. And ultimately, how can I generate the desired data frame?

Answer Source

Seems to me you're probably looking for read.fwf. Below is the method I used. Of course, you'd want to get rid of the textConnection part and replace it with the path to your file but otherwise I think this should work.

d <- read.fwf(textConnection(
"    1  
*******
*******
*******
*******
  S  H 
 HHHHH 
    2  
*******
JSH   K
*******
*******
*******
*******"), 
    widths = rep(1, 7),
    na = c(" "),
    stringsAsFactors = FALSE)

id <- as.numeric(d[seq(1, nrow(d), 7), 5])
id <- rep(id, each = 6)

d <- d[seq(1, nrow(d), 7), ]
d <- cbind(id, d)
names(d)[-1] <- paste0("a", 1:7)
d

   id   a1   a2   a3   a4   a5   a6   a7
3   1    *    *    *    *    *    *    *
4   1    *    *    *    *    *    *    *
5   1    *    *    *    *    *    *    *
6   1 <NA> <NA>    S <NA> <NA>    H <NA>
7   1 <NA>    H    H    H    H    H <NA>
8   1 <NA> <NA> <NA> <NA>    2 <NA> <NA>
9   2    *    *    *    *    *    *    *
10  2    J    S    H <NA> <NA> <NA>    K
11  2    *    *    *    *    *    *    *
12  2    *    *    *    *    *    *    *
13  2    *    *    *    *    *    *    *
14  2    *    *    *    *    *    *    *
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download