lorenzov -4 years ago 120
R Question

# euclidean distance between vectors grouped by other variable in SPSS, R or Excel

I have a dataset containing something like this:

``````case,group,val1,val2,val3,val4
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
``````

I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups.

This means that I have x number of cases in n number of groups. The euclidean distance is computed between pairs of rows and then averaged for the group. So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (i.e. [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group.

Can anyone suggest a neat way of achieving this in a reasonably efficient way?

As an example of how I would approach this in SPSS, first lets read the example data into SPSS.

``````data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.
``````

Then we can use `SPLIT FILE` and `PROXIMITIES` to get our distance matrix by group. Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, e.g. `split file` and `SPLIT FILE` are equivalent).

``````sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').
``````

Unlike R, basically everything within an SPSS data matrix is like an R `data.frame`, so SPLIT file near functionally replaces all the different `*ply` functions in R. Very convienant, but less flexible in general. So now we need to aggregate the distances in the `dist` file I saved the results to. We first sum across rows, and then sum by group via an `AGGREGATE` command.

``````dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
/OUTFILE='dist_agg'
/BREAK=group
/dist_sum = SUM(dist_sum)
/N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).
``````

Here I save the aggregated results into another dataset named `dist_agg`. Because SPSS (annoyingly) saves the full distance matrix, the mean will not be `n*(n-1)/2` (as in the equivalent R syntax), but will be `n*(n-1)` assuming you do not want to count the diagonal elements towards the mean. Then we can just merge these back into the `orig` data file via a match files command.

``````*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.

*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.
``````

The flexibility of R to go back and forth between `matrix` and `data.frame` objects makes SPSS a bit more clunky for this job. I could write a much more concise program to do this in SPSS's `MATRIX` language, but to do it across groups in `MATRIX` is a pain in the butt (compared to R's `*ply` syntax).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download