lorenzov -4 years ago 120

R Question

I have a dataset containing something like this:

`case,group,val1,val2,val3,val4`

1,1,3,5,6,8

2,1,2,7,5,4

3,2,1,3,6,8

4,2,5,4,3,7

5,1,8,6,5,3

I'm trying to compute programmatically the Euclidean distance between the vectors of values in groups.

This means that I have x number of cases in n number of groups. The euclidean distance is computed between pairs of rows and then averaged for the group. So, in the example above, first I compute the mean and std dev of group 1 (case 1, 2 and 5), then standardise values (i.e. [(original value - mean)/st dev], then compute the ED between case 1 and case 2, case 2 and 5, and case 1 and 5, and finally average the ED for the group.

Can anyone suggest a neat way of achieving this in a reasonably efficient way?

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

As an example of how I would approach this in SPSS, first lets read the example data into SPSS.

```
data list list (",") / case group val1 val2 val3 val4 (6F1.0).
begin data
1,1,3,5,6,8
2,1,2,7,5,4
3,2,1,3,6,8
4,2,5,4,3,7
5,1,8,6,5,3
end data.
dataset name orig.
```

Then we can use `SPLIT FILE`

and `PROXIMITIES`

to get our distance matrix by group. Note, as you mentioned in the comments to flodel's answer, this produces a seperate dataset we need to work with (also note case practically never matters in SPSS syntax, e.g. `split file`

and `SPLIT FILE`

are equivalent).

```
sort cases by group.
split file by group.
dataset declare dist.
PROXIMITIES val1, val2, val3, val4
/STANDARDIZE = Z
/MEASURE = EUCLID
/PRINT = NONE
/MATRIX = OUT('dist').
```

Unlike R, basically everything within an SPSS data matrix is like an R `data.frame`

, so SPLIT file near functionally replaces all the different `*ply`

functions in R. Very convienant, but less flexible in general. So now we need to aggregate the distances in the `dist`

file I saved the results to. We first sum across rows, and then sum by group via an `AGGREGATE`

command.

```
dataset activate dist.
compute dist_sum = SUM(VAR1 to VAR3).
*it appears SPSS keeps empty cases - we dont want them in the aggregation.
select if MISSING(dist_sum) = 0.
dataset activate dist.
DATASET DECLARE dist_agg.
AGGREGATE
/OUTFILE='dist_agg'
/BREAK=group
/dist_sum = SUM(dist_sum)
/N_Cases=N.
dataset activate dist_agg.
compute mean_dist = dist_sum /(N_Cases*(N_Cases - 1)).
```

Here I save the aggregated results into another dataset named `dist_agg`

. Because SPSS (annoyingly) saves the full distance matrix, the mean will not be `n*(n-1)/2`

(as in the equivalent R syntax), but will be `n*(n-1)`

assuming you do not want to count the diagonal elements towards the mean. Then we can just merge these back into the `orig`

data file via a match files command.

```
*merge back into the original dataset.
dataset activate orig.
match files file = *
/table = 'dist_agg'
/by group.
exe.
*clean out old datasets if you like.
dataset close dist.
dataset close dist_agg.
```

The flexibility of R to go back and forth between `matrix`

and `data.frame`

objects makes SPSS a bit more clunky for this job. I could write a much more concise program to do this in SPSS's `MATRIX`

language, but to do it across groups in `MATRIX`

is a pain in the butt (compared to R's `*ply`

syntax).

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**

Latest added