lorenzov lorenzov -4 years ago 120
R Question

euclidean distance between vectors grouped by other variable in SPSS, R or Excel

I have a dataset containing something like this:

case,group,val1,val2,val3,val4
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3


I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups.

This means that I have x number of cases in n number of groups. The euclidean distance is computed between pairs of rows and then averaged for the group. So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (i.e. [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group.

Can anyone suggest a neat way of achieving this in a reasonably efficient way?

Answer Source

As an example of how I would approach this in SPSS, first lets read the example data into SPSS.

data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.

Then we can use SPLIT FILE and PROXIMITIES to get our distance matrix by group. Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, e.g. split file and SPLIT FILE are equivalent).

sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').

Unlike R, basically everything within an SPSS data matrix is like an R data.frame, so SPLIT file near functionally replaces all the different *ply functions in R. Very convienant, but less flexible in general. So now we need to aggregate the distances in the dist file I saved the results to. We first sum across rows, and then sum by group via an AGGREGATE command.

dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
  /OUTFILE='dist_agg'
  /BREAK=group
  /dist_sum = SUM(dist_sum)
  /N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).

Here I save the aggregated results into another dataset named dist_agg. Because SPSS (annoyingly) saves the full distance matrix, the mean will not be n*(n-1)/2 (as in the equivalent R syntax), but will be n*(n-1) assuming you do not want to count the diagonal elements towards the mean. Then we can just merge these back into the orig data file via a match files command.

*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.

*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.

The flexibility of R to go back and forth between matrix and data.frame objects makes SPSS a bit more clunky for this job. I could write a much more concise program to do this in SPSS's MATRIX language, but to do it across groups in MATRIX is a pain in the butt (compared to R's *ply syntax).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download