user2117258 user2117258 - 1 month ago 8
R Question

Connect points on a scatterplot to the two other most similar points given a correlation matrix

I have clustered a few points and have computed the mean of each cluster as a landmark point for that cluster. I have then computed a correlation matrix among all landmark points to see which are most similar. Now I'd like to connect each landmark points to its two most similar neighbors. Since these landmark points do not have X,Y coordinates on the clustering map, I am using the centroid points for each cluster as starting point to connect landmarks.

My

assignments
data.frame looks something like this:

> head(assignments)
Transcripts Genes Timepoint Run Cluster V1 V2 Cell meanX meanY
8A_0_AATCTGCACCAA 143327 10542 Day 0 8A 6 113.8933 -2.1280855 8A_0_AATCTGCACCAA 124.3976 -8.682189
8A_0_CATGTCCTATCT 117322 10334 Day 0 8A 6 110.0499 -2.1553971 8A_0_CATGTCCTATCT 124.3976 -8.682189
8A_0_ATGCTCAATTGG 102764 9974 Day 0 8A 6 104.7227 -0.8397611 8A_0_ATGCTCAATTGG 124.3976 -8.682189
8A_0_CTACGGGAGAGT 92832 9651 Day 0 8A 6 101.3370 -5.0928108 8A_0_CTACGGGAGAGT 124.3976 -8.682189
8A_0_GTAGGGCGCGCT 90264 8807 Day 0 8A 6 113.3947 -18.9441484 8A_0_GTAGGGCGCGCT 124.3976 -8.682189
8A_0_ACGAGCTAACGG 83663 9148 Day 0 8A 7 114.6545 -31.6095622 8A_0_ACGAGCTAACGG 113.3952 -38.072025


.. and is used to generate the plot below:

ggplot(assignments, aes(V1, V2)) + geom_point(aes(colour=Cluster)) + geom_text(aes(meanX, meanY, label=Cluster), hjust=0.5, vjust=0.5, color='black', size=10)


enter image description here

Now given the following landmark correlation matrix (shown below), I'd like to connect each centroid point to it's nearest/most correlated two others.

> correlations
1 2 3 4 5 6 7 8 9 10
1 1.0000000 0.8269796 0.7542429 0.8443087 0.5627945 0.7106869 0.6511076 0.7880531 0.7279651 0.7842836
2 0.8269796 1.0000000 0.9491927 0.9723831 0.6921389 0.9001103 0.8452948 0.9581868 0.9001655 0.9408375
3 0.7542429 0.9491927 1.0000000 0.9376269 0.7786622 0.8843569 0.8662250 0.9243512 0.9026685 0.9570069
4 0.8443087 0.9723831 0.9376269 1.0000000 0.6919623 0.9091975 0.8542862 0.9568544 0.9019741 0.9461385
5 0.5627945 0.6921389 0.7786622 0.6919623 1.0000000 0.7064235 0.7538936 0.6941766 0.7517064 0.7844258
6 0.7106869 0.9001103 0.8843569 0.9091975 0.7064235 1.0000000 0.9341175 0.9404398 0.8969552 0.8830658
7 0.6511076 0.8452948 0.8662250 0.8542862 0.7538936 0.9341175 1.0000000 0.8822696 0.9116052 0.8958741
8 0.7880531 0.9581868 0.9243512 0.9568544 0.6941766 0.9404398 0.8822696 1.0000000 0.9316483 0.9219810
9 0.7279651 0.9001655 0.9026685 0.9019741 0.7517064 0.8969552 0.9116052 0.9316483 1.0000000 0.9402076
10 0.7842836 0.9408375 0.9570069 0.9461385 0.7844258 0.8830658 0.8958741 0.9219810 0.9402076 1.0000000


The resulting plot is anticipated to look similar to the plot above but with an over lay of a sort of network, where centroids are connect to 2 most similar neighbors/centroids. Any help would be greatly appreciated!

EDIT1:

I should mention that the landmark cells which are used to produce the correlation matrix is simply an average of the underlying data for cells within the designated cluster:

# compute `landmark cell` for each cluster
data = cbind(assignments, t(dge[,assignments$Cell]))
cluster.gene.avg.list = list()
for(n in unique(data$Cluster)) {temp.cluster = subset(data, Cluster==n)[,11:ncol(data)]; cluster.gene.avg.list[[n]] = rowMeans(t(temp.cluster))}
landmark = do.call(cbind, cluster.gene.avg.list)


.. Where
dge
are gene expression values and a matrix with dimensions of 16015 by 2449:

> head(dge[,1:5])
8A_3_GACACGTAGGCC 8A_3_TTACAAATGTCA 8A_3_GCTCAAATCTTC 8A_7_CCGCCCCGACTT 8A_0_AATCTGCACCAA
0610005C13RIK 0.00000000 0.00000000 0.09081976 0.00000000 0.0000000
0610007P14RIK 0.34322315 0.39803339 0.72224870 0.80916196 0.3551089
0610009B22RIK 0.07548816 0.25172063 0.17625931 0.18493077 0.4317327
0610009L18RIK 0.00000000 0.17259527 0.09081976 0.00000000 0.0000000
0610009O20RIK 0.00000000 0.08887713 0.09081976 0.09542651 0.0000000
0610010B08RIK 0.56896378 0.91807267 0.83163550 0.86439381 0.7635860


EDIT2

Thanks to /u/sandipan for the help!

# correlation between each landmark
correlations = cor(landmark, method="spearman") # correlation methods: pearson, spearman or kendall
dist.correlations = dist(1-cor(landmark, method="spearman"))
diag(correlations) = 0

# find the 2 nearest neighbors by highest correlation
nnbrs <- as.data.frame(t(apply(correlations, 1, function(x) {y <- sort(x, index.return=TRUE, decreasing=TRUE); c(y$ix[1],y$x[1],y$ix[2],y$x[2])})),stringsAsFactors = FALSE)
names(nnbrs) <- c('id1', 'dist1', 'id2', 'dist2')
nnbrs$id <- seq(1,length(names(landmark)))
nnbrs1 <- nnbrs[c('id', 'id1', 'dist1')]
nnbrs2 <- nnbrs[c('id', 'id2', 'dist2')]
names(nnbrs2) <- c('id', 'id1', 'dist1')
nnbrs <- rbind(nnbrs1, nnbrs2)

# create data.frame of center coordinates for each cluster
centers = data.frame(unique(cbind(assignments$Cluster,assignments$meanX, assignments$meanY)))
names(centers) = c("Cluster", "X", "Y")
centers = centers[order(centers$Cluster),]

# create data.frame of line segements based on 2 nearest correlations
segments = t(apply(nnbrs, 1, function(x) c(centers[as.integer(x[1]), 2:3], centers[as.integer(x[2]), 2:3], as.numeric(x[3]))))
segments = data.frame(t(do.call(cbind, segments)))
names(segments) <- c('x', 'y', 'xend', 'yend', 'corr')
segments = data.frame(sapply(segments, as.numeric))
segments$corr <- as.factor(segments$corr)
plot + geom_segment(data=segments, aes(x=x, y=y, xend=xend, yend=yend, col=corr), lwd=1.2) + guides(col=FALSE)


Result:

enter image description here

Now time to figure out how to keep the cluster colors and create a continuous color scale for the correlation based segments!

Answer

Try this (with synthetically generated data with 8 clusters and randomly generated correlation matrix):

head(assignments)
          V1       V2 Cluster     meanX    meanY
1  -96.93875 89.73655       8 -99.24848 50.61038
2  -96.86518 63.81925       8 -99.24848 50.61038
3  -76.63706 59.05426       8 -99.24848 50.61038
4 -105.90429 60.40880       8 -99.24848 50.61038
5 -100.39240 54.27822       8 -99.24848 50.61038
6  -99.53031 39.01734       8 -99.24848 50.61038

#res <- kmeans(assignments, 8) # 8 clusters
#centers <- res$centers # for kmeans

centers <- centers[,2:3] # in you case

correlations # this will be a 10x10 matrix in your case
           [,1]       [,2]       [,3]      [,4]      [,5]       [,6]      [,7]      [,8]
[1,] 0.28708827 0.12476841 0.24545908 0.2588388 0.2074115 0.75373879 0.8104132 0.5754160
[2,] 0.73768137 0.47982080 0.67638982 0.7976242 0.9919874 0.68068729 0.9534392 0.2404903
[3,] 0.94252193 0.03406601 0.87475370 0.4167443 0.9181345 0.75985783 0.6763228 0.9912269
[4,] 0.09300806 0.26816248 0.77741727 0.3892989 0.8545009 0.79482925 0.5123970 0.3311057
[5,] 0.69044589 0.04903995 0.14823010 0.9018917 0.9461897 0.04739289 0.6008395 0.2522856
[6,] 0.07651553 0.36061880 0.92448094 0.2414908 0.9768005 0.50474048 0.1748254 0.9701859
[7,] 0.07449400 0.30025228 0.05877126 0.1055387 0.6143566 0.87633754 0.8646951 0.1123956
[8,] 0.58755791 0.44420559 0.17486185 0.3668967 0.7989782 0.21354636 0.3137961 0.1086797

p <- ggplot(assignments, aes(V1, V2)) + geom_point(aes(colour=Cluster)) + geom_text(aes(meanX, meanY, label=Cluster), hjust=0.5, vjust=0.5, color='black', size=10)

# compute the endpoints of the segments to draw with the 2 NNs for each cluster
library(reshape2)
nnbrs <- as.data.frame(t(apply(correlations, 1, function(x) sort(x, index.return=TRUE)$ix[1:2])),stringsAsFactors = FALSE)
nnbrs$id <- 1:8 # 8 clusters
nnbrs <- melt(nnbrs, id='id')
segments <- as.data.frame(t(apply(nnbrs, 1, function(x) cbind(centers[as.integer(x[1]),],centers[as.integer(x[3]),]))))
names(segments) <- c('x', 'y', 'xend', 'yend')

p + geom_segment(data=segments, aes(x=x, y=y, xend=xend, yend=yend))

enter image description here

If you want segments colored w.r.t. correlation value, try this (with a different set of randomly generated points):

p <- ggplot(assignments, aes(V1, V2)) + geom_point(aes(colour=Cluster)) + geom_text(aes(meanX, meanY, label=Cluster), hjust=0.5, vjust=0.5, color='black', size=10)
nnbrs <- as.data.frame(t(apply(correlations, 1, function(x) {y <- sort(x, index.return=TRUE); c(y$ix[1],y$x[1],y$ix[2],y$x[2])})),stringsAsFactors = FALSE)
names(nnbrs) <- c('id1', 'dist1', 'id2', 'dist2')
nnbrs$id <- 1:8
nnbrs1 <- nnbrs[c('id', 'id1', 'dist1')]
nnbrs2 <- nnbrs[c('id', 'id2', 'dist2')]
names(nnbrs2) <- c('id', 'id1', 'dist1')
nnbrs <- rbind(nnbrs1, nnbrs2)
segments <- as.data.frame(t(apply(nnbrs, 1, function(x) c(centers[as.integer(x[1]),],centers[as.integer(x[2]),],as.numeric(x[3])))))
names(segments) <- c('x', 'y', 'xend', 'yend', 'corr')
segments$corr <- as.factor(segments$corr)
p + geom_segment(data=segments, aes(x=x, y=y, xend=xend, yend=yend, col=corr),lwd=1.2) +  guides(col=FALSE)

enter image description here