Jamie Leigh - 1 year ago 106
R Question

# squared Euclidean distance and correlation between two normalized variables: a proportional factor?

I am working on a homework assignment, and am not sure I understand the question. We are using the built-in

`iris`
dataset, I have already reduced the data to only the numerical columns, and created a scaled data set. But when I get to this portion of the question I am lost.

Calculate the Euclidean distances between the columns of the scaled data set
using
`dist()`
function. Show that the squares of these Euclidean distances are proportional to the
`(1-correlation)`
s. What is the value of the proportional factor here?

``````data <- iris[1:4]
scaled <- scale(data)
``````

I tried using
`dist()`
, but don't think I am getting the correct output:

``````dist(scaled)
``````

This prints out a massive output that I am not entirely sure what to do with. I don't know how else to approach this. I don't even know what it means when it asks what is the value of the proportional factor. I am pretty sure that the correlations it wants me to compare it to is

``````cor(data)
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
#Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
#Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
#Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
``````

But how do I compare the massive output from the
`dist()`
to this?

I am just hoping someone can help explain the question, and point me in the correct direction.

This prints out a massive output that I am not entirely sure what to do with.

I am just hoping someone can help explain the question, and point me in the correct direction.

You want `dist(t(scaled))` because `dist` takes distance between rows. Consider your scaled dataset:

``````x <- scale(data.matrix(iris[1:4]))
``````

The squared Euclidean distance matrix between columns is

``````## I have used `c()` outside to coerce it into a plain vector
d <- c(dist(t(x)) ^ 2)
# [1] 333.03580  38.21737  54.25354 425.67515 407.10553  11.06610
``````

The lower triangular of correlation matrix is (we want lower triangular because the distance matrix is giving lower triangular part):

``````cx <- cor(x)[lower.tri(diag(4))]
# [1] -0.1175698  0.8717538  0.8179411 -0.4284401 -0.3661259  0.9628654
``````

``````d / (1 - cx)
`iris` dataset has 150 rows, you should realize that `298 = 2 * (150 - 1)`.