Ranjan Pandey - 3 months ago 13

R Question

I am trying to create a comprehensive automated code for my team for missing value imputation using several different methods. I know the logic but I am having trouble in the data class identification which is important in deciding which method to chose for imputation.

The data that am working on looks like this:

Now, I want my code to identify the type of variables as:

- Categorical/Factor with multiple levels
- Factor with two levels 1 and 0(binary)
- Factor with two levels except 1 and 0, like 'yes' and 'no'
- Continuous

Here is the WIP code that I have but it isn't doing the job well and I understand the logic will fail given the data is different

`data_type_vector<-function(x)`

{

categorical_index<-character()

binary_index<-character()

continuous_index<-character()

binary_index_1<-character()

data<-x

for(a in 1:ncol(data)){

if(length(unique(data[,a])) >= 2 & length(unique(data[,a])) < 15 &

max(as.character(data[,a]),na.rm=T) != 1 & min(as.character(data[,a]),na.rm=T) !=0)

{

categorical_index<-c(categorical_index,colnames(data[a]))

} else if (max(as.character(data[,a]),na.rm=T) == 1 & min(as.character(data[,a],na.rm=T))==0) {

binary_index<-c(binary_index,colnames(data[a]))

} else if (length(unique(data[,a]))==2) {

#this basically defines categorical variables with two categories like male/female

#which don't have 1 0 values in the data but are still binary

#we are keeping them seperate for the purpose of further analysis

binary_index_1<-c(binary_index_1,colnames(data[a]))

} else

{

continuous_index<-c(continuous_index,colnames(data[a]))

}

}

assign("categorical_index",categorical_index,envir=globalenv())

assign("binary_index",binary_index,envir=globalenv())

assign("continuous_index",continuous_index,envir=globalenv())

assign("binary_index_1",binary_index_1,envir=globalenv())

}

I am trying to improve the logics that I have used to make it generic so that others can use it but I have kind of hit a wall here. Appreciate any help.

Answer

This can be done by checking the number of levels and the levels themselves. `categorize`

is the generic that invokes `categorize.data.frame`

if given a data.frame. It in turn invokes `categorize.default`

for each column. `categorize`

can also directly be called on a column.

The way it works is that it computes the number of levels except if there are three or more it uses 3 and it adds on 2 if the levels are "0" and "1". This gives us a number between 0 and 4 inclusive. Then we set up a factor with meaningful level names.

Note that anything that is not a factor will be identified as "continuous". For example, as implied by the question, a column containing just 0's and 1's is continuous as it is not a factor.

```
categorize <- function(x, ...) UseMethod("categorize")
categorize.data.frame <- function(x, ...) sapply(x, categorize)
categorize.default <- function(x, ...) {
factor(min(nlevels(x), 3) + 2*identical(levels(x), c("0", "1")), levels = 0:4,
labels = c("continuous", "factor1", "factor2", "factor", "zero-one"))
}
```

Now test it out:

```
DF <- data.frame(a = factor(c(0, 1, 0)), b = factor(c("male", "female", "male")),
c = factor(1:3), d = 1:3)
categorize(DF)
## a b c d
## zero-one factor2 factor continuous
## Levels: continuous factor1 factor2 factor zero-one
categorize(DF$a)
## [1] zero-one
## Levels: continuous factor1 factor2 factor zero-one
categorize(0:1)
## [1] continuous
## Levels: continuous factor1 factor2 factor zero-one
```

**Note:** Since what is being asked for is close to just asking for the number of levels, an alternative might be to just return the number of levels and use -2 to mean a binary factor with "0", "1" levels. That is,

```
categorize.default <- function(x, ...) nlevels(x) - 4 * identical(levels(x), c("0", "1"))
```