Kyle Kyle - 1 month ago 22
R Question

Plotting two principal component score vectors, using a different color to indicate three unique classes

After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.

I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).

#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20)

#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50)

xymatrix <- cbind(y,x)
dim(x)
[1] 60 50
dim(xymatrix)
[1] 60 51
pca=prcomp(xymatrix, scale=TRUE)


How should I correctly plot and color this principal component analysis as noted above? Thank you.

Answer

If I understand your question correctly, ggparcoord in Gally package would help you.

library(GGally)
y <- rep(c(1,2,3), 20)

# matrix of 50 variables i.e. 50 columns and 60 rows 
# i.e. 60x50 dimensions (=3000 table cells)   
x <- matrix(rnorm(3000), ncol=50)

xymatrix <- cbind(y,x)
pca <- prcomp(xymatrix, scale=TRUE)

# Principal components score and group label 'y'
pc_label <- data.frame(pca$x, y=as.factor(y))

# Plot the first two principal component score of each samples
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))

However, I think it makes more sense to do PCA on x rather than xymatrix that includes the target y. So the following codes should be more appropriate in your case.

pca <- prcomp(x, scale=TRUE)

pc_label <- data.frame(pca$x, y=as.factor(y))

ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
Comments