S.Fischer S.Fischer - 25 days ago 20
R Question

Importing dataset in specific way RStudio

I am trying to import a dataset in RStudio which looks like this:

segment 1 rank 1 points 10
11.11111 22.22222
33.33333 44.44444
... ...
99.99999 99.99999
segment 2 rank 4 points 40
12.12345 45.56789
... ...
98.76543 54.43211


The dataset is split up in segments, all of which have a rank and some pairs of values. The amount of pairs is indicated by the 'points' variable.

I have tried many different ways of importing this dataset, using
scan()
,
read.table()
and
read.csv()
for example. I have also tried many different options with these importing methods, none of them worked properly.

Using
read.table()
function, I constantly got the following error:


Line 2 did not have 6 elements


As of now I managed to import the dataset the following way:

read.table(DF, fill=TRUE)


However this way the dataset in R is absolutely littered with missing values, due to the fill option, making working with the dataset extremely tedious.

How would I go about importing the dataset in a cleaner way? So I can eliminate the missing fields.
For example:

Segment Rank Points Value1 Value2
1 2 10 11.111 22.222
1 2 10 33.333 44.444
... ... ... ... ...
22 4 60 88.888 99.999

Answer

Your task is more than just reading in data. You are trying to reformat it too. readLines will import the data as a vector. You can sort everything out from there.

pull <- function(s) {
  .s <- str_extract(dat[indx], paste0("(?<=",s,").*?\\d+"))
  as.numeric(rep(.s, (rle(grps)$lengths-1)))
}

library(stringr)
dat <- readLines("ex.data")
indx <- grep("segment", dat)
grps <- cumsum(1:length(dat) %in% indx)
labels <- sapply(c("segment", "rank", "points"), pull)
nums <- do.call("rbind", str_split(trimws(dat[-indx]), " "))
data.frame(labels, values=unlist(nums))
#    segment rank points values.1 values.2
# 1        1    1     10 11.11111 22.22222
# 2        1    1     10 33.33333 44.44444
# 3        1    1     10     56.2 49.95662
# 4        1    1     10     46.9 50.00784
# 5        1    2     10     50.7 48.55493
# 6        1    2     10     53.6 48.70022
# 7        1    2     20     42.7 52.51411
# 8        1    2     20     51.5 47.53790
# 9        1    2     20     50.0 50.22578
# 10       2    3     30     50.3 49.68094
# 11       2    3     30     50.4 49.78991
# 12       2    3     30 99.99999 99.99999
# 13       2    5     60 12.12345 45.56789
# 14       2    5     60 98.76543 54.43211

Example Data

cat("segment 1 rank 1 points 10  
11.11111 22.22222    
33.33333 44.44444   
56.2 49.95662
46.9 50.00784
segment 1 rank 2 points 10
50.7 48.55493
53.6 48.70022
segment 1 rank 2 points 20
42.7 52.51411
51.5 47.53790
50.0 50.22578
segment 2 rank 3 points 30
50.3 49.68094
50.4 49.78991 
99.99999 99.99999    
segment 2 rank 5 points 60   
12.12345 45.56789
98.76543 54.43211", file="ex.data")
Comments