Carolina Karoullas Carolina Karoullas - 17 days ago 4
R Question

Creating a data set with paired data and converting it into a matrix

So, I'm using R to try and do a phylogenetic PCA on a dataset that I have using the phyl.pca function from the phytools package. However, I'm having issues organising my data in a way that the function will accept! And that's not all: I did a bit of experimenting and I know that there are more issues further down the line, which I will get into...

Getting straight to the issue, here's the data frame (with dummy data) that I'm using:

Taxa Tibia Feather
1 Microraptor 138 101
2 Microraptor 139 114
3 Microraptor 145 141
4 Anchiornis 160 81
5 Anchiornis 14 NA
6 Archaeopteryx 134 82
7 Archaeopteryx 136 71
8 Archaeopteryx 132 NA
9 Archaeopteryx 14 NA
10 Scansoriopterygidae 120 85
11 Scansoriopterygidae 116 NA
12 Scansoriopterygidae 123 NA
13 Sapeornis 108 NA
14 Sapeornis 112 86
15 Sapeornis 118 NA
16 Sapeornis 103 NA
17 Confuciusornis 96 NA
18 Confuciusornis 107 30
19 Confuciusornis 148 33
20 Confuciusornis 128 61

The taxa are arranged into a tree (called "tree") with Microraptor being the most basal and then progressing in order through to Confuciusornis:


Phylogenetic tree: tree

Number of tips: 6
Number of nodes: 5
Branch lengths:
mean: 1
variance: 0
distribution summary:
Min. 1st Qu. Median 3rd Qu. Max.
1 1 1 1 1
No root edge.
Tip labels: Confuciusornis
No node labels.

And the function:

>phyl.pca(tree, all, method="BM", mode="corr")

And this is the error that is coming up:

Error in phyl.pca(tree, all, method = "BM", mode = "corr") :
number of rows in Y cannot be greater than number of taxa in your tree

Y being the "all" data frame. So I have 6 taxa in my tree (matching the 6 taxa in the data frame) but there are 20 rows in my data frame. So I used this function:

> all_agg <- aggregate(all[,-1],by=list(all$Taxa),mean,na.rm=TRUE)

And got this:

Group.1 Tibia Feather
1 Anchiornis 153 81
2 Archaeopteryx 136 77
3 Confuciusornis 120 41
4 Microraptor 141 119
5 Sapeornis 110 86
6 Scansoriopterygidae 120 85

It's a bit odd that the order of the taxa has changed... Is this ok?

In any case, I converted it into a matrix:

> all_agg_matrix <- as.matrix(all_agg)
> all_agg_matrix
Group.1 Tibia Feather
[1,] "Anchiornis" "153" "81"
[2,] "Archaeopteryx" "136" "77"
[3,] "Confuciusornis" "120" "41"
[4,] "Microraptor" "141" "119"
[5,] "Sapeornis" "110" "86"
[6,] "Scansoriopterygidae" "120" "85"

And then used the phyl.pca function:

> phyl.pca(tree, all_agg_matrix, method = "BM", mode = "corr")
[1] "Y has no names. function will assume that the row order of Y matches tree$tip.label"
Error in invC %*% X : requires numeric/complex matrix/vector arguments

So, now the order that the function is considering taxa in is all wrong (but I can fix that relatively easily). The issue is that phyl.pca doesn't seem to believe that my matrix is actually a matrix. Any ideas why?


I think you may have bigger problems. Most phylogenetic methods, I suspect including phyl.pca, assume that traits are fixed at the species level (i.e., they don't account for within-species variation). Thus, if you want to use phyl.pca, you probably need to collapse your data to a single value per species, e.g. via

dd_agg <- aggregate(dd[,-1],by=list(dd$Taxa),mean,na.rm=TRUE)

Extract the numeric columns and label the rows properly so that phyl.pca can match them up with the tips correctly:

dd_mat <- dd_agg[,-1]
rownames(dd_mat) <- dd_agg[,1]

Using these aggregated data, I can make up a tree (since you didn't give us one) and run phyl.pca ...

tt <- rcoal(nrow(dd_agg),tip.label=dd_agg[,1])

If you do need to do an analysis that takes within-species variation into account you might need to ask somewhere more specialized, e.g. the mailing list ...