Jamie Leigh Jamie Leigh - 13 days ago 11
R Question

squared Euclidean distance and correlation between two normalized variables: a proportional factor?

I am working on a homework assignment, and am not sure I understand the question. We are using the built-in

iris
dataset, I have already reduced the data to only the numerical columns, and created a scaled data set. But when I get to this portion of the question I am lost.

Calculate the Euclidean distances between the columns of the scaled data set
using
dist()
function. Show that the squares of these Euclidean distances are proportional to the
(1-correlation)
s. What is the value of the proportional factor here?


data <- iris[1:4]
scaled <- scale(data)


I tried using
dist()
, but don't think I am getting the correct output:

dist(scaled)


This prints out a massive output that I am not entirely sure what to do with. I don't know how else to approach this. I don't even know what it means when it asks what is the value of the proportional factor. I am pretty sure that the correlations it wants me to compare it to is

cor(data)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
#Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
#Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
#Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000


But how do I compare the massive output from the
dist()
to this?

I am just hoping someone can help explain the question, and point me in the correct direction.

Answer

This prints out a massive output that I am not entirely sure what to do with.

I am just hoping someone can help explain the question, and point me in the correct direction.

You want dist(t(scaled)) because dist takes distance between rows. Consider your scaled dataset:

x <- scale(data.matrix(iris[1:4]))

The squared Euclidean distance matrix between columns is

## I have used `c()` outside to coerce it into a plain vector
d <- c(dist(t(x)) ^ 2)
# [1] 333.03580  38.21737  54.25354 425.67515 407.10553  11.06610

The lower triangular of correlation matrix is (we want lower triangular because the distance matrix is giving lower triangular part):

cx <- cor(x)[lower.tri(diag(4))]
# [1] -0.1175698  0.8717538  0.8179411 -0.4284401 -0.3661259  0.9628654

We then just do what your question asks to compare:

d / (1 - cx)
# [1] 298 298 298 298 298 298

iris dataset has 150 rows, you should realize that 298 = 2 * (150 - 1).


Update

I had no intention to post theoretical justification here. But the down vote irritates me and I am going to do it now.

theoretical justification