Djiggy Djiggy - 3 months ago 14
R Question

multiple level Data frame

I am looking on how to store multiple instances of different variables types in R. I tried to use dataframes (and lists) but cannot get it to do what I want. Let me try to show you with an example what I would like to achieve.

Let's say I create a type of data type (a basket) that contains a number and a string, like :

iNbLine = 2
df<-data.frame(Weight=double(iNbLine), Color=character(iNbLine),stringsAsFactors=F)
row.names(df)<-c("apples","pears")
df
Weight Color
apples 0
pears 0


I can now update my data structure as I want. For example :

df$Weight[1]=158
df$Color[1]="green"
df
Weight Color
apples 158 green
pears 0


What I would like to do however is have a higher-level data than contains several of these baskets with additional data(here the price), so I tried this :

iNbBasket =5
df2<-data.frame(Price=double(iNbBasket), Basket=rep(df,iNbBasket))


But this gives me

Error in data.frame(Price = double(iNbBasket), Basket = rep(df, iNbBasket)) : arguments imply differing number of rows: 5, 2


What I would like to be able to do is access the weight of apples of my 2nd basket for example; while keeping the possibility to set the price of the 2nd basket. I hope this is clear enough. In C language, I think I was able to define a new data type (basket) using "struct", which I could then include in another data type but I cannot figure how to do it here.

For @joran this is an attempt to show what I would like :

Baskets
Name Price Names Weight Color
Basket1 250 apples 158 green
pears 32 yellow
Basket2 120 apples 70 green
pears 10 yellow


But being able to access line 3, by something like :

myBasket<-myData[2]
myBasket$Weight[1]
70


and do :

myBasket$Price = 130


Update 1
I looked through lists, S3 variable types, and dplyr. I have to admit I did not understand everything but so far I do not have exactly what I want. I currently do the following

iNbLine = 2
df<-data.frame(Weight=double(iNbLine), Color=c("green","yellow"),stringsAsFactors=F)
row.names(df)<-c("apples","pears")

iNbBasket=3
dfBaskets<-data.frame(Price=double(iNbBasket))
row.names(dfBaskets)=c("Basket1","Basket2","Basket3")

lBasketsContent<-list()
for(i in 1:iNbBasket){
lBasketsContent[[i]]=df
}


This way I can access the price :

iBasket =2
dfBaskets$Price[2] = 150


and any element of a given basket :

lBasketsContent[[2]]$Weight[1] = 300


as well as the basket itself (I pass it to a function in my real case)

dfBasket<-lBasketsContent[[2]]


It is easy to read but requires 2 containers.

Answer

Hadley's tidyr (with purrr) provide something like this. Take a look at "tidyr 0.4.0" for a demonstration of complex structures nested within a data.frame cell.

Their examples typically rely on having relevant information in the other cells before populated the others, and even then populating them based on some form of grouping. For example, using mtcars:

library(dplyr)
library(tidyr)
library(purrr)

mtcars %>%
  transmute(model = rownames(mtcars), mpg, cyl, disp, gear) %>%
  group_by(cyl)
# Source: local data frame [32 x 5]
# Groups: cyl [3]
#                model   mpg   cyl  disp  gear
#                <chr> <dbl> <dbl> <dbl> <dbl>
# 1          Mazda RX4  21.0     6 160.0     4
# 2      Mazda RX4 Wag  21.0     6 160.0     4
# 3         Datsun 710  22.8     4 108.0     4
# 4     Hornet 4 Drive  21.4     6 258.0     3
# 5  Hornet Sportabout  18.7     8 360.0     3
# 6            Valiant  18.1     6 225.0     3
# 7         Duster 360  14.3     8 360.0     3
# 8          Merc 240D  24.4     4 146.7     4
# 9           Merc 230  22.8     4 140.8     4
# 10          Merc 280  19.2     6 167.6     4
# # ... with 22 more rows

If we call nest() on a grouping, you can see how things are compacted a bit:

quux1 <- mtcars %>%
  transmute(model = rownames(mtcars), mpg, cyl, disp, gear) %>%
  group_by(cyl) %>%
  nest()
quux1
# # A tibble: 3 x 2
#     cyl              data
#   <dbl>            <list>
# 1     6  <tibble [7 x 4]>
# 2     4 <tibble [11 x 4]>
# 3     8 <tibble [14 x 4]>
quux1$data[[1]]
# # A tibble: 7 x 4
#            model   mpg  disp  gear
#            <chr> <dbl> <dbl> <dbl>
# 1      Mazda RX4  21.0 160.0     4
# 2  Mazda RX4 Wag  21.0 160.0     4
# 3 Hornet 4 Drive  21.4 258.0     3
# 4        Valiant  18.1 225.0     3
# 5       Merc 280  19.2 167.6     4
# 6      Merc 280C  17.8 167.6     4
# 7   Ferrari Dino  19.7 145.0     5

You can do some processing on this, dplyr-style:

quux2 <- mtcars %>%
  transmute(model = rownames(mtcars), mpg, cyl, disp, gear) %>%
  group_by(cyl) %>%
  nest() %>%
  mutate(mpg2 = purrr::map(data, ~ lm(mpg ~ disp + gear, data = .)))
quux2
# # A tibble: 3 x 3
#     cyl              data     mpg2
#   <dbl>            <list>   <list>
# 1     6  <tibble [7 x 4]> <S3: lm>
# 2     4 <tibble [11 x 4]> <S3: lm>
# 3     8 <tibble [14 x 4]> <S3: lm>

And deal with the models individually:

summary(quux2$mpg2[[2]])
# Call:
# lm(formula = mpg ~ disp + gear, data = .)
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -3.2691 -1.7130  0.0708  1.7617  3.4351 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)   
# (Intercept) 30.77234    7.33123   4.197  0.00301 **
# disp        -0.13189    0.03094  -4.263  0.00275 **
# gear         2.38529    1.54132   1.548  0.16032   
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 2.623 on 8 degrees of freedom
# Multiple R-squared:  0.7294,  Adjusted R-squared:  0.6618 
# F-statistic: 10.78 on 2 and 8 DF,  p-value: 0.005361

A more robust use of this would deal with the models programmatically, of course, but this is just a start.

NB: I am not suggesting that mpg ~ disp + gear is a reasonable model :-)

Update 1

How about this:

Start with "default" basket contents, a hybrid list/data.frame:

df <- list(Price = 0,
           Contents = data.frame(Names = c("apples", "pears"),
                                 Weight = rep(0, 2L),
                                 Color = c("green","yellow"),
                                 stringsAsFactors = F)
           )

Create a list of three baskets (three customers?):

nBaskets <- 3L
# start with 3 empty baskets
lBaskets <- replicate(nBaskets, df, simplify = FALSE)
str(lBaskets)
# List of 3
#  $ :List of 2
#   ..$ Price   : num 0
#   ..$ Contents:'data.frame':  2 obs. of  3 variables:
#   .. ..$ Names : chr [1:2] "apples" "pears"
#   .. ..$ Weight: num [1:2] 0 0
#   .. ..$ Color : chr [1:2] "green" "yellow"
#  $ :List of 2
#   ..$ Price   : num 0
#   ..$ Contents:'data.frame':  2 obs. of  3 variables:
#   .. ..$ Names : chr [1:2] "apples" "pears"
#   .. ..$ Weight: num [1:2] 0 0
#   .. ..$ Color : chr [1:2] "green" "yellow"
#  $ :List of 2
#   ..$ Price   : num 0
#   ..$ Contents:'data.frame':  2 obs. of  3 variables:
#   .. ..$ Names : chr [1:2] "apples" "pears"
#   .. ..$ Weight: num [1:2] 0 0
#   .. ..$ Color : chr [1:2] "green" "yellow"

Now, customer 2 wants to buy something:

cust <- 2
lBaskets[[ cust ]]$Contents$Weight[1] <- 300
lBaskets[[ cust ]]$Price <- 150
lBaskets[[ cust ]]
# $Price
# [1] 150
# $Contents
#    Names Weight  Color
# 1 apples    300  green
# 2  pears      0 yellow

Without getting into S4 objects (perhaps over-engineered for what you are trying to do), I think this is the most straight-forward way. If you want/need to make a quick reference to a specific customer's Contents and reassign it back into the list, that's certainly doable but not required.

Comments