ragesz ragesz - 3 months ago 20
R Question

R convert summary result (statistics with all dataframe columns) into dataframe

[I'm new to R...] I have this dataframe:

df1 <- data.frame(c(2,1,2), c(1,2,3,4,5,6), seq(141,170)) #create data.frame
names(df1) <- c('gender', 'age', 'height') #column names


I want the
df1
's summary in a dataframe object that looks like this:

count mean std min 25% 50% 75% max
age 30.0000 3.5000 1.7370 1.0000 2.0000 3.5000 5.0000 6.0000
gender 30.0000 1.6667 0.4795 1.0000 1.0000 2.0000 2.0000 2.0000
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000


I've generated this in Python with
df1.describe().T
. How can I do this in R?

It would be a gratis if my summary dataframe would contain the "dtype", "null" (number of
NULL
values), (number of) "unique" and "range" values as well to have a comprehensive summary statistics:

count mean std min 25% 50% 75% max null unique range dtype
age 30.0000 3.5000 1.7370 1.0000 2.0000 3.5000 5.0000 6.0000 0 6 5 int64
gender 30.0000 1.6667 0.4795 1.0000 1.0000 2.0000 2.0000 2.0000 0 2 1 int64
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000 0 30 29 int64


The Python code of above result is:

df1.describe().T.join(pd.DataFrame(df1.isnull().sum(), columns=['null']))\
.join(pd.DataFrame.from_dict({i:df1[i].nunique() for i in df1.columns}, orient='index')\
.rename(columns={0:'unique'}))\
.join(pd.DataFrame.from_dict({i:(df1[i].max() - df1[i].min()) for i in df1.columns}, orient='index')\
.rename(columns={0:'range'}))\
.join(pd.DataFrame(df1.dtypes, columns=['dtype']))


Thank you!

Answer

I commonly use a little function (adapted from a script found on the net) to do this kind of transformation:

sumstats = function(x) {
  mean.k=function(x) {if (is.numeric(x)) round(mean(x), digits = 2)
    else "N*N"}
  median.k=function(x) {  if (is.numeric(x)) round(median(x), digits = 2)
    else "N*N"}
  sd.k=function(x) {  if (is.numeric(x)) round(sd(x), digits = 2)
    else "N*N"}
  min.k=function(x) {  if (is.numeric(x)) round(min(x), digits = 2)
    else "N*N"}
  max.k=function(x) {  if (is.numeric(x)) round(max(x), digits = 2)
    else "N*N"}
  sumtable <- cbind(as.matrix(colSums(!is.na(x))), sapply(x,mean.k), sapply(x,median.k), sapply(x,sd.k),  sapply(x,min.k), sapply(x,max.k))
  sumtable <- as.data.frame(sumtable);  names(sumtable) <- c("Count","Mean","Med","sd","min","max")
  return(sumtable)
}
sumstats(df1)
####        Count   Mean   Med   sd min max
#### gender    30   1.67   2.0 0.48   1   2
#### age       30   3.50   3.5 1.74   1   6
#### height    30 155.50 155.5 8.80 141 170

You might easily adapt it to add more descriptive columns, such as quantiles, nulls, range, etc. It does return a data.frame. You also might want to specify in advance the behaviour with NAs in the arguments.

Hope it helps.