xiii1408 - 9 months ago 29

R Question

I'm working on a research project for school. I've written some text mining software that analyzes legal texts in a collection and spits out a score that indicates how similar they are. I ran the program to compare each text with every other text, and I have data like this (although with many more points):

`codeofhammurabi.txt crete.txt 0.570737`

codeofhammurabi.txt iraqi.txt 1.13475

codeofhammurabi.txt magnacarta.txt 0.945746

codeofhammurabi.txt us.txt 1.25546

crete.txt iraqi.txt 0.329545

crete.txt magnacarta.txt 0.589786

crete.txt us.txt 0.491903

iraqi.txt magnacarta.txt 0.834488

iraqi.txt us.txt 1.37718

magnacarta.txt us.txt 1.09582

Now I need to plot them on a graph. I can easily invert the scores so that a small value now indicates texts that are similar and a large value indicates texts that are dissimilar: the value can be the distance between points on a graph representing the texts.

`codeofhammurabi.txt crete.txt 1.75212`

codeofhammurabi.txt iraqi.txt 0.8812

codeofhammurabi.txt magnacarta.txt 1.0573

codeofhammurabi.txt us.txt 0.7965

crete.txt iraqi.txt 3.0344

crete.txt magnacarta.txt 1.6955

crete.txt us.txt 2.0329

iraqi.txt magnacarta.txt 1.1983

iraqi.txt us.txt 0.7261

magnacarta.txt us.txt 0.9125

SHORT VERSION:

Those values directly above are distances between points on a scatter plot (1.75212 is the distance between the codeofhammurabi point and the crete point). I can imagine a big system of equations with circles representing the distances between points. What's the best way to make this graph? I have MATLAB, R, Excel, and access to pretty much any software I might need.

If you can even point me in a direction, I'll be infinitely grateful.

Answer

Your data are really distances (of some form) in the multivariate space spanned by the corpus of words contained in the documents. Dissimilarity data such as these are often ordinated to provide the best *k*-d mapping of the dissimilarities. Principal coordinates analysis and non-metric multidimensional scaling are two such methods. I would suggest you plot the results of applying one or the other of these methods to your data. I provide examples of both below.

First, load in the data you supplied (without labels at this stage)

```
con <- textConnection("1.75212
0.8812
1.0573
0.7965
3.0344
1.6955
2.0329
1.1983
0.7261
0.9125
")
vec <- scan(con)
close(con)
```

What you effectively have is the following distance matrix:

```
mat <- matrix(ncol = 5, nrow = 5)
mat[lower.tri(mat)] <- vec
colnames(mat) <- rownames(mat) <-
c("codeofhammurabi","crete","iraqi","magnacarta","us")
> mat
codeofhammurabi crete iraqi magnacarta us
codeofhammurabi NA NA NA NA NA
crete 1.75212 NA NA NA NA
iraqi 0.88120 3.0344 NA NA NA
magnacarta 1.05730 1.6955 1.1983 NA NA
us 0.79650 2.0329 0.7261 0.9125 NA
```

R, in general, needs a dissimilarity object of class `"dist"`

. We could use `as.dist(mat)`

now to get such an object, or we could skip creating `mat`

and go straight to the `"dist"`

object like this:

```
class(vec) <- "dist"
attr(vec, "Labels") <- c("codeofhammurabi","crete","iraqi","magnacarta","us")
attr(vec, "Size") <- 5
attr(vec, "Diag") <- FALSE
attr(vec, "Upper") <- FALSE
> vec
codeofhammurabi crete iraqi magnacarta
crete 1.75212
iraqi 0.88120 3.03440
magnacarta 1.05730 1.69550 1.19830
us 0.79650 2.03290 0.72610 0.91250
```

Now we have an object of the right type we can ordinate it. R has many packages and functions for doing this (see the Multivariate or Environmetrics Task Views on CRAN), but I'll use the **vegan** package as I am somewhat familiar with it...

```
require("vegan")
```

First I illustrate how to do principal coordinates analysis on your data using **vegan**.

```
pco <- capscale(vec ~ 1, add = TRUE)
pco
> pco
Call: capscale(formula = vec ~ 1, add = TRUE)
Inertia Rank
Total 10.42
Unconstrained 10.42 3
Inertia is squared Unknown distance (euclidified)
Eigenvalues for unconstrained axes:
MDS1 MDS2 MDS3
7.648 1.672 1.098
Constant added to distances: 0.7667353
```

The first PCO axis is by far the most important at explaining the between text differences, as exhibited by the Eigenvalues. An ordination plot can now be produced by plotting the Eigenvectors of the PCO, using the `plot`

method

```
plot(pco)
```

which produces

A non-metric multidimensional scaling (nMDS) does not attempt to find a low dimensional representation of the original distances in an Euclidean space. Instead it tries to find a mapping in *k* dimensions that best preserves the **rank** ordering of the distances between observations. There is no closed-form solution to this problem (unlike the PCO applied above) and an iterative algorithm is required to provide a solution. Random starts are advised to assure yourself that the algorithm hasn't converged to a sub-optimal, locally optimal solution. Vegan's `metaMDS`

function incorporates these features and more besides. If you want plain old nMDS, then see `isoMDS`

in package **MASS**.

```
set.seed(42)
sol <- metaMDS(vec)
> sol
Call:
metaMDS(comm = vec)
global Multidimensional Scaling using monoMDS
Data: vec
Distance: user supplied
Dimensions: 2
Stress: 0
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation
Species: scores missing
```

With this small data set we can essentially represent the rank ordering of the dissimilarities perfectly (hence the warning, not shown). A plot can be achieved using the `plot`

method

```
plot(sol, type = "text", display = "sites")
```

which produces

In both cases the distance on the plot between samples is the best 2-d approximation of their dissimilarity. In the case of the PCO plot, it is a 2-d approximation of the real dissimilarity (3 dimensions are needed to represent all of the dissimilarities fully), whereas in the nMDS plot, the distance between samples on the plot reflects the rank dissimilarity not the actual dissimilarity between observations. But essentially distances on the plot represent the computed dissimilarities. Texts that are close together are most similar, texts located far apart on the plot are the most dissimilar to one another.