Imlerith - 1 year ago 248
R Question

# Normalize data in R data.frame column

Suppose I have the following data:

``````a <- data.frame(var1=letters,var2=runif(26))
``````

Suppose I want to scale every value in
`var2`
such that the sum of the
`var2`
column is equal to 1 (basically turn the var2 column into a probability distribution)

I have tried the following:

``````a\$var2 <- lapply(a\$var2,function(x) (x-min(a\$var2))/(max(a\$var2)-min(a\$var2)))
``````

this not only gives an overall sum greater than 1 but also turns the
`var2`
column into a list on which I can't do operations like
`sum`

Is there any valid way of turning this column into a probability distribution?

Suppose you have a vector `x` with non-negative values and no `NA`, you can normalize it by

``````x / sum(x)
``````

which is a proper probability mass function.

The transform you take:

``````(x - min(x)) / (max(x) - min(x))
``````

only rescales `x` onto `[0, 1]`, but does not ensure "summation to 1".

Regarding you code

There is no need to use `lapply` here:

``````lapply(a\$var2, function(x) (x-min(a\$var2)) / (max(a\$var2) - min(a\$var2)))
``````

Just use vectorized operation

``````a\$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))
``````

As you said, `lapply` gives you a list, and that is what "l" in "lapply" refers to. You can use `unlist` to collapse that list into a vector; or, you can use `sapply`, where "s" implies "simplification (when possible)".

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download