Ranjan Pandey Ranjan Pandey - 1 year ago 77
R Question

How do I identify the type of variable in a dataframe in R?

I am trying to create a comprehensive automated code for my team for missing value imputation using several different methods. I know the logic but I am having trouble in the data class identification which is important in deciding which method to chose for imputation.

The data that am working on looks like this:
enter image description here

Now, I want my code to identify the type of variables as:

  1. Categorical/Factor with multiple levels

  2. Factor with two levels 1 and 0(binary)

  3. Factor with two levels except 1 and 0, like 'yes' and 'no'

  4. Continuous

Here is the WIP code that I have but it isn't doing the job well and I understand the logic will fail given the data is different



for(a in 1:ncol(data)){

if(length(unique(data[,a])) >= 2 & length(unique(data[,a])) < 15 &
max(as.character(data[,a]),na.rm=T) != 1 & min(as.character(data[,a]),na.rm=T) !=0)


} else if (max(as.character(data[,a]),na.rm=T) == 1 & min(as.character(data[,a],na.rm=T))==0) {


} else if (length(unique(data[,a]))==2) {

#this basically defines categorical variables with two categories like male/female
#which don't have 1 0 values in the data but are still binary
#we are keeping them seperate for the purpose of further analysis


} else




I am trying to improve the logics that I have used to make it generic so that others can use it but I have kind of hit a wall here. Appreciate any help.

Answer Source

This can be done by checking the number of levels and the levels themselves. categorize is the generic that invokes categorize.data.frame if given a data.frame. It in turn invokes categorize.default for each column. categorize can also directly be called on a column.

The way it works is that it computes the number of levels except if there are three or more it uses 3 and it adds on 2 if the levels are "0" and "1". This gives us a number between 0 and 4 inclusive. Then we set up a factor with meaningful level names.

Note that anything that is not a factor will be identified as "continuous". For example, as implied by the question, a column containing just 0's and 1's is continuous as it is not a factor.

categorize <- function(x, ...) UseMethod("categorize")

categorize.data.frame <- function(x, ...) sapply(x, categorize)

categorize.default <- function(x, ...) {
   factor(min(nlevels(x), 3) + 2*identical(levels(x), c("0", "1")), levels = 0:4, 
    labels = c("continuous", "factor1", "factor2", "factor", "zero-one"))

Now test it out:

DF <- data.frame(a = factor(c(0, 1, 0)), b = factor(c("male", "female", "male")), 
         c = factor(1:3), d = 1:3)

##          a          b          c          d 
##   zero-one    factor2     factor continuous 
## Levels: continuous factor1 factor2 factor zero-one

## [1] zero-one
## Levels: continuous factor1 factor2 factor zero-one

## [1] continuous
## Levels: continuous factor1 factor2 factor zero-one

Note: Since what is being asked for is close to just asking for the number of levels, an alternative might be to just return the number of levels and use -2 to mean a binary factor with "0", "1" levels. That is,

categorize.default <- function(x, ...) nlevels(x) - 4 * identical(levels(x), c("0", "1"))