luchonacho - 4 months ago 23

Python Question

Consider the following (excel) dataset:

`m | r`

----|------

2.0 | 3.3

0.8 |

| 4.0

1.3 |

2.1 | 5.2

| 2.3

| 1.9

2.5 |

1.2 | 3.0

2.0 | 2.6

Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlationafterthe empty cells have been filled in.Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.

So far I have done it using Matlab:

`clear all;`

m = xlsread('data.xlsx','A2:A11') ;

r = xlsread('data.xlsx','B2:B11') ;

rho = corr(m,r,'rows','pairwise');

x0 = [1,1,1,1,1,1];

lb = [0,0,0,0,0,0];

f = @(x)my_correl(x,rho);

SOL = fmincon(f,x0,[],[],[],[],lb)

where the function

`my_correl`

`function X = my_correl(x,rho)`

sum_m = (11.9 + x(1) + x(2) + x(3));

sum_r = (22.3 + x(1) + x(2) + x(3));

avg_m = (11.9 + x(1) + x(2) + x(3))/8;

avg_r = (22.3 + x(4) + x(5) + x(6))/8;

rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;

rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);

X = (rho - rho_num/rho_den)^2;

end

This function computes the correlation manually, where every missing data is a variable

`x(i)`

Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.

Note 2: a 100 points bounty will be awarded to the answer. Promise from now.

Answer

This is how I would approach it, with an implementation in R provided:

There is not a unique solution for imputing the missing data points, such that the pairwise correlation of the complete (imputed) data is equal to the pairwise correlation of the incomplete data. So to find a 'good' solution rather than just 'any' solution, we can introduce an additional criteria that the complete imputed data should also share the same linear regression with the original data. This leads us to a fairly simple approach.

- calculate a linear regression model for the original data.
- find the imputed values for missing values that would lie exactly on this regression line.
- generate a random scatter of residuals for the imputed values around this regression line
- scale the imputed residuals to force the correlation of the complete imputed data to be equal to that of the original data

A solution like this in R:

```
library(data.table)
set.seed(123)
rho = cor(dt$m,dt$r,'pairwise')
# calculate linear regression of original data
fit1 = lm(r ~ m, data=dt)
fit2 = lm(m ~ r, data=dt)
# extract the standard errors of regression intercept (in each m & r direction)
# and multiply s.e. by sqrt(n) to get standard deviation
sd1 = summary(fit1)$coefficients[1,2] * sqrt(dt[!is.na(r), .N])
sd2 = summary(fit2)$coefficients[1,2] * sqrt(dt[!is.na(m), .N])
# find where data points with missing values lie on the regression line
dt[is.na(r), r.imp := coefficients(fit1)[1] + coefficients(fit1)[2] * m]
dt[is.na(m), m.imp := coefficients(fit2)[1] + coefficients(fit2)[2] * r]
# generate randomised residuals for the missing data, using the s.d. calculated above
dt[is.na(r), r.ran := rnorm(.N, sd=sd1)]
dt[is.na(m), m.ran := rnorm(.N, sd=sd2)]
# function that scales the residuals by a factor x, then calculates how close correlation of imputed data is to that of original data
obj = function(x, dt, rho) {
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x]
dt[is.na(m), m.comp := m.imp + m.ran*x]
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2
}
# find the value of x that minimises the discrepencay of imputed versus original correlation
fit = optimize(obj, c(-5,5), dt, rho)
x=fit$minimum
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x]
dt[is.na(m), m.comp := m.imp + m.ran*x]
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2 # check that rho2 is approximately equal to rho
```

As a final check, calculate linear regression of the complete imputed data and plot to show that regression line is same as for original data. Note that the plot below was for the large data set shown below, to demonstrate use of this method on large data.

```
fit.comp = lm(r.comp ~ m.comp, data=dt)
plot(dt$m.comp, dt$r.comp)
points(dt$m, dt$r, col="red")
abline(fit1, col="green")
abline(fit.comp, col="blue")
mtext(paste(" Rho =", round(rho,5)), at=-1)
mtext(paste(" Rho2 =", round(rho2, 5)), at=6)
```

**DATA**

original toy data from OP example:

```
dt=structure(list(m = c(2, 0.8, NA, 1.3, 2.1, NA, NA, 2.5, 1.2, 2),
r = c(3.3, NA, 4, NA, 5.2, 2.3, 1.9, NA, 3, 2.6)),
.Names = c("m", "r"), row.names = c(NA, -10L),
class = c("data.table", "data.frame"))
```

A larger data set to demonstrate on big data

```
dt = data.table(m=rnorm(1e5, 3, 2))[, r:=1.5 + 1.1*m + rnorm(1e5,0,2)]
dt[sample(.N, 3e4), m:=NA]
dt[sample(which(!is.na(m)), 3e4), r := NA]
```