Jonas Ungerbäck Jonas Ungerbäck - 2 months ago 7
R Question

Deriving number of unique ID counts from a data frame and creating new columns for counts and values

I have a dataframe reshape/counting problem. Consider the following data frame with a column of non-unique IDs and a value column (could be uniques but doesn't have to be):


id value_df
1 1 A
2 1 B
3 1 C
4 2 D
5 2 E
6 3 F
7 4 G
8 4 H
9 4 I
10 4 J

What I am after is to create a dataframe where the first column contains the unique id's, the second the counts of these and the rest n columns the values for each id, like this:

id number_id value_df_1 value_df_2 value_df_3 value_df_4
1 1 3 A B C N
2 2 2 D E <NA> <NA>
3 3 1 F <NA> <NA> <NA>
4 4 4 G I J K

Using the plyr-package deriving the counts are easy, like this


but then my problem starts. To get the rest of the dataframe I have tried melt and dcast from the reshape2-package.

df_melted<-melt(df,id.vars =c("id"), measure.vars = c("value_df"))

This, however, results in the following output:

id A B C D E F G H I J
1 1 A B C <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 2 <NA> <NA> <NA> D E <NA> <NA> <NA> <NA> <NA>
3 3 <NA> <NA> <NA> <NA> <NA> F <NA> <NA> <NA> <NA>
4 4 <NA> <NA> <NA> <NA> <NA> <NA> G H I J

This output has many more "value" columns than what I wished for but I cannot find and easy way to simplify it to the output I want above. I was also thinking that the last step to get the number_id in would be with rbind.

Needless to say my actual dataframes contain many thousands of row which make the current melt/dcast output very clumsy with thousands of columns.


Use this:

df$num <- ave(as.character(df$value_df), df$id, FUN = seq_along)
df = reshape(df,idvar = "id",direction = "wide",timevar = "num" )


> df
  id value_df.1 value_df.2 value_df.3 value_df.4
1  1          A          B          C       <NA>
4  2          D          E       <NA>       <NA>
6  3          F       <NA>       <NA>       <NA>
7  4          G          H          I          J

What this does is it adds a numeric column that sequences 1:n for each group of ids. Then it uses those values as the new column names