agenis agenis - 6 months ago 51
R Question

visual structure of a data.frame: locations of NAs and much more

I want to represent the structure of a data.frame (or matrix, or data.table whatever) on a single plot with colorcoding. I guess that could be very useful for many people handling various types of data, to visualize it in a single glance.

Perhaps someone have already developed a package to do it, but I couldn't find one (just this). So here is a rough mockup of my "vision", kind of a heatmap, showing in color codes:

  • the NA locations,

  • the class of variables (factors (how many levels?), numeric (with color gradient, zeros, outliers...), strings)

  • dimensions

  • etc.....

enter image description here

So far I have just written a function to plot the NA locations it goes like this:

ggSTR = function(data, alpha=0.5){
DF <- data
if (!is.matrix(data)) DF <- as.matrix(DF)

to.plot <-'y'=rep(1:nrow(DF), each=ncol(DF)),
'x'=as.logical(t(*rep(1:ncol(DF), nrow(DF)))
size <- 20 / log( prod(dim(DF)) ) # size of point depend on size of table
g <- ggplot(data=to.plot) + aes(x,y) +
geom_point(size=size, color="red", alpha=alpha) +
scale_y_reverse() + xlim(1,ncol(DF)) +
ggtitle("location of NAs in the data frame")

pc <- round(sum(*100, 2) # % NA
print(paste("percentage of NA data: ", pc))


It takes any data.frame in input and returns this image:

enter image description here

It's too big a challenge for me to achieve the first image.


Have you encountered the CSV fingerprint service? It creates a similar image, althought not with all the details you have outlined above, and it's not based on R. There is an R version of a similar idea at, but the text is in Finnish. The main function is csvSormenjalki(). Maybe that could be adapted further to fulfill your whole vision?