user3206440 user3206440 - 1 year ago 73
R Question

R - histogram data in aggreagation

With a data frame like below:

set.seed(100)
df <- data.frame(id = sample(1:5, 6, replace = TRUE),
prop1 = rep(c("A", "B"), 3),
prop2 = sample(c(TRUE, FALSE), 6, replace = TRUE),
prop3=sample(3:6, 6, replace = TRUE))

> df
id prop1 prop2 prop3
1 2 A FALSE 4
2 2 B TRUE 4
3 3 A FALSE 6
4 1 B TRUE 5
5 3 A FALSE 3
6 3 B FALSE 4


I need to do an aggregation by
id
such that ,for each col
prop1
to
propN
, a histogram data is generated as follows.

For each
id
,


  • prop1
    need to capture ratio of number of discrete values -
    "A"
    s ,
    "B"
    s for all records with same
    id
    which can be accessed via names like
    prop1[["A"]]
    &
    prop1[["B"]]

  • prop2
    need to capture ratio of number of discrete values -
    "TRUE"
    s ,
    "FALSE"
    s for all records with same
    id
    which can be accessed via names like
    prop1[["TRUE"]]
    &
    prop1[["FALSE"]]

  • prop3
    need to capture ratio of number of discrete values -
    "3, 4, 5, 6"
    for all records with same
    id
    which can be accessed via names like
    prop1[["3"]]
    ,
    prop1[["4"]]
    ,
    prop1[["5"]]
    ,
    prop1[["6"]]



How to get the aggregation for
prop1
to
propN
done in the above format - using base R

Update:Adding output representation.

I'm not certain about the right data type to represent the output and various components in the output. However a spreadsheet view of the output would be as follows. In realty the output desired is in a form such that it can be used as a look-up table for the distribution on an
id
basis for further computation.

spreadsheet view of output

Answer Source

Here is an idea which uses a custom function defined as follows:

It splits the data frame based on the id and applies the formula (prop.table(table(...))) for finding the ratio. The n acts as an index so as to identify for which column you need the ratio. If n is 2 for example, then fun1 will apply the formula of finding the ratio to column 2 for each element of the list (effectively for each id). Finally, we apply the function via looping through 2:ncol(df) (so in your case 2:4) in order to get the ratio for all columns of interest, for each id.

#convert to factors to make sure you will get 0 frequencies with table as well
df[-1] <- lapply(df[-1], as.factor)

fun1 <- function(df, n){as.data.frame(t(sapply(split(df, df$id), function(i) 
                                                         prop.table(table(i[,n])))))}

data.frame(id = unique(sort(df$id)), 
           do.call(cbind, sapply(2:ncol(df), function(i)fun1(df, i))))

#   id        A         B FALSE. TRUE.        X3        X4  X5       X6
#1  1 0.0000000 1.0000000    0.0   1.0 0.0000000 0.0000000  1 0.0000000
#2  2 0.5000000 0.5000000    0.5   0.5 0.0000000 1.0000000  0 0.0000000
#3  3 0.6666667 0.3333333    1.0   0.0 0.3333333 0.3333333  0 0.3333333

Another way to structure this, would be to create a list and name each element of the list with the column names of your original df. i.e.

l1 <- sapply(2:ncol(df), function(i)fun1(df, i))
names(l1) <- names(df[-1])

#so you can extract each one separately,

l1[['prop1']]
#          A         B
#1 0.0000000 1.0000000
#2 0.5000000 0.5000000
#3 0.6666667 0.3333333
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download