Imlerith - 4 months ago 64

R Question

Suppose I have the following data:

`a <- data.frame(var1=letters,var2=runif(26))`

Suppose I want to scale every value in

`var2`

`var2`

I have tried the following:

`a$var2 <- lapply(a$var2,function(x) (x-min(a$var2))/(max(a$var2)-min(a$var2)))`

this not only gives an overall sum greater than 1 but also turns the

`var2`

`sum`

Is there any valid way of turning this column into a probability distribution?

Answer

Suppose you have a vector `x`

with non-negative values and no `NA`

, you can normalize it by

```
x / sum(x)
```

which is a proper probability mass function.

The transform you take:

```
(x - min(x)) / (max(x) - min(x))
```

only rescales `x`

onto `[0, 1]`

, but does not ensure "summation to 1".

**Regarding you code**

There is no need to use `lapply`

here:

```
lapply(a$var2, function(x) (x-min(a$var2)) / (max(a$var2) - min(a$var2)))
```

Just use vectorized operation

```
a$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))
```

As you said, `lapply`

gives you a list, and that is what "l" in "lapply" refers to. You can use `unlist`

to collapse that list into a vector; or, you can use `sapply`

, where "s" implies "simplification (when possible)".