j_simskii j_simskii - 3 months ago 10
R Question

Altering distribution of one dataset to match another dataset

I have 2 datasets, one of modeled (artificial) data and another with observed data. They have slightly different statistical distributions and I want to force the modeled data to match the observed data distribution in the spread of the data. In other words, I need the modeled data to better represent the tails of the observed data. Here's an example.

model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)

summary(model)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.62 36.98 40.38 40.28 44.91 54.15

summary(observed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.62 35.54 42.58 41.10 47.76 59.2


How can I force the model data to have the variability that the observed has in R?

Answer

Are you just modeling the distribution of observed? If so, you could generate a kernel density estimate from the observations and then resample from that modeled density distribution. For example:

library(ggplot2)

Density estimate for observed values. This is our model of the distribution of the observed values:

dens.obs = density(observed)

Resample from density estimate to get modeled values. We set prob=dens$y so that the probability of a value in dens$x being chosen is proportional to its modeled density.

set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)

Put observed and modeled values in a data frame in preparation for plotting:

dat = data.frame(value=c(observed,resample.obs), 
                 group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))

ggplot(dat, aes(value, fill=group, colour=group)) +
  geom_density(alpha=0.4) +
  theme_bw()

enter image description here