Kyle - 2 months ago 32

R Question

After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.

I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).

`#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)`

y <- rep(c(1,2,3),20)

#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)

x <- matrix( rnorm(3000), ncol=50)

xymatrix <- cbind(y,x)

dim(x)

[1] 60 50

dim(xymatrix)

[1] 60 51

pca=prcomp(xymatrix, scale=TRUE)

How should I correctly plot and color this principal component analysis as noted above? Thank you.

Answer

If I understand your question correctly, `ggparcoord`

in `Gally`

package would help you.

```
library(GGally)
y <- rep(c(1,2,3), 20)
# matrix of 50 variables i.e. 50 columns and 60 rows
# i.e. 60x50 dimensions (=3000 table cells)
x <- matrix(rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
pca <- prcomp(xymatrix, scale=TRUE)
# Principal components score and group label 'y'
pc_label <- data.frame(pca$x, y=as.factor(y))
# Plot the first two principal component score of each samples
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
```

However, I think it makes more sense to do PCA on `x`

rather than `xymatrix`

that includes the target `y`

. So the following codes should be more appropriate in your case.

```
pca <- prcomp(x, scale=TRUE)
pc_label <- data.frame(pca$x, y=as.factor(y))
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
```