Sam - 8 months ago 73

R Question

I have a data set that has 900 columns of numeric data and I need to convert the numeric columns to factors that have labels. Many labels will repeat. I am trying to write a function that will take the numeric columns, identify the type of label that the column needs, and then apply that label.

Here is an example data frame:

`#create data frame with columns a,b,c,d`

a<-c(1,2,3,4,5)

b<-c(0,1,0,1,0)

c<-c(1,0,1,0,1)

d<-c(2,3,4,5,3)

x<-as.data.frame(cbind(a,b,c,d))

I have a separate dataframe (i.e. y) that includes a key (i.e. column e) that identifies which factor labels should be applied to which of the columns (i.e. column f). Notice that b and c should have the same label.

`e<-c(1,2,2,3)`

f<-c("a","b","c","d")

y<-as.data.frame(cbind(e,f))

I would like to write a function that does the following, but automated. Here are the example labels that I would like to apply to a,b,c,d--where a and d are different, but b and c are the same.

`x$a<-factor(x$a,`

levels=c(1,2,3,4,5),

labels=c("Less than 25%",

"25-50%",

"51-75%",

"76-90%",

"More than 90%"))

x$b<-factor(x$b,

levels=c(0,1),

labels=c("Yes","No"))

x$c<-factor(x$c,

levels=c(0,1),

labels=c("Yes","No"))

x$d<-factor(x$c,

levels=c(1,2,3,4,5),

labels=c("l","m","n","o","p"))

With the final data set looking like:

`>x`

a b c d

1 Less than 25% Yes No m

2 25-50% No Yes n

3 51-75% Yes No o

4 76-90% No Yes p

5 More than 90% Yes No n

In the actual data set, there will be close to 60 labels.

Answer

If you can get the labels and levels appropriately associated with the `e`

column, which is your link to the columns of the dataset, you can do this via `purrr:pmap_df`

.

Here's how that would look. Most of the work is in getting the labels and levels as a list column, which I do via `tibble`

(loaded with dplyr).

Starting with your second dataset, `y`

, which is an important part of this.

```
e = c(1,2,2,3)
f = names(x)
y = data.frame(e,f)
e f
1 1 a
2 2 b
3 2 c
4 3 d
```

Make sure the levels and labels are available and can be associated with your `e`

vector. If they are in a long format, you could get them into a list-column format via `tidyr::nest`

. I found this to be the most time-consuming step in terms of getting this info written out.

```
library(dplyr)
levels.labels = tibble(e = c(1, 2, 3),
levels = list(1:5, 0:1, 1:5),
labels = list(c("Less than 25%",
"25-50%",
"51-75%",
"76-90%",
"More than 90%"),
c("Yes","No"),
c("l","m","n","o","p")))
```

If you needed to write your levels and labels out within R you might want to try `tribble`

, which is available in the development version of the *tibble* package.

```
library(tibble)
levels.labels = tribble(~e, ~levels, ~labels,
1, 1:5, c("Less than 25%",
"25-50%",
"51-75%",
"76-90%",
"More than 90%"),
2, 0:1, c("Yes","No"),
3, 1:5, c("l","m","n","o","p"))
```

Merge the levels and labels with your `y`

dataset based on `e`

. The rows of the result is a 1 to 1 match of the columns of `x`

.

```
key = left_join(y, levels.labels)
e f levels labels
1 1 a 1, 2, 3, 4, 5 Less than 25%, 25-50%, 51-75%, 76-90%, More than 90%
2 2 b 0, 1 Yes, No
3 2 c 0, 1 Yes, No
4 3 d 1, 2, 3, 4, 5 l, m, n, o, p
```

To factor each column, put the x dataset, the levels, and the labels all into a named list. The names of each element correspond to the names of the arguments you need to use from `factor`

. This allows you to easily use `pmap_df`

from *purrr* to `factor`

each column of `x`

, using the known levels and labels information.

```
library(purrr)
pmap_df(list(x = x, levels = key$levels, labels = key$labels), factor)
# A tibble: 5 x 4
a b c d
<fctr> <fctr> <fctr> <fctr>
1 Less than 25% Yes No m
2 25-50% No Yes n
3 51-75% Yes No o
4 76-90% No Yes p
5 More than 90% Yes No n
```

In `pmap`

functions, the elements within the list must be all the same size. In this case, the first element has 4 columns and the second two are vectors with length 4.