MAPK MAPK - 4 months ago 7
R Question

How to expand data matrix for corresponding column names

I have this data matrix called

mymat
. It has got
.GT
columns for samples
00860
and
00861
. I want to expand this matrix with new
.AD
column. The corresponding
.AD
columns for each sample will have values
50,0
if
.GT
is
0/0
,
25/25
if
.GT
is
0/1
and
0,50
if
.GT
is
1/1
. I also want to add another column called
.DP
next to each column which will have
50
across the column and get the
result
. How can I do this kind of conditional expansion of matrix in R?

mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193"
), c("00860.GT", "00861.GT")))


result:

00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 0/1 25/25 50 0/0 50,0 50
chr1:1888193 1/1 0/50 50 0/0 50,0 50

jav jav
Answer

Here's a data.table solution, with each line commented. It is written to handle any number of columns in your mymat object. I will explain briefly:

1) First, we convert to a data.table format where we can handle any number of columns, assuming it will be in a similar format.

2) We find all of the ".GT" columns and extract the number before the ".GT".

3) We create ".DP" columns for each ".GT" column found.

4) We develop a "GT" to "AD" mapping by creating a vector of the "to" part of the mapping. The "from" part is stored as names in the vector.

5) Use the .SDcols feature in the data.table to apply the "GT" to "AD" mapping, and create the "AD" columns.

# Your matrix
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), 
                   .Dimnames = list(c("chr1:1163804", "chr1:1888193"), 
                    c("00860.GT", "00861.GT")))

# Using a data table approach
library(data.table)

# Casting to data table - row.names will be converted to a column called 'rn'.
mymat = as.data.table(mymat, keep.rownames = T)

# Find "GT" columns
GTcols = grep("GT", colnames(mymat))

# Get number before ".GT"
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols])

selectedCols
[1] "00860" "00861"

# Create ".DP" columns
mymat[, paste0(selectedCols, ".DP") := 50, with = F]

mymat
             rn 00860.GT 00861.GT 00860.DP 00861.DP
1: chr1:1163804      0/1      0/0       50       50
2: chr1:1888193      1/1      0/0       50       50

# Create "GT" to "AD" mapping
GTToADMapping = c("50,0", "25/25", "0/50")
names(GTToADMapping) = c("0/0", "0/1", "1/1")

GTToADMapping
0/0     0/1     1/1 
"50,0" "25/25"  "0/50" 

# This function will return the "AD" mapping given the values of "GT"
mapGTToAD <- function(x){
  return (GTToADMapping[x])
}

# Here, we create the AD columns using the GT mapping
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F,
        .SDcols = colnames(mymat)[GTcols]]

             rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD
1: chr1:1163804      0/1      0/0       50       50    25/25     50,0
2: chr1:1888193      1/1      0/0       50       50     0/50     50,0

# We can sort the data now as you have it
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"), 
                     paste0(selectedCols, ".AD"), 
                     paste0(selectedCols, ".DP")))
mymat = mymat[, c("rn", colOrder), with = F]

mymat
             rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
1: chr1:1163804      0/1    25/25       50      0/0     50,0       50
2: chr1:1888193      1/1     0/50       50      0/0     50,0       50

# Put it back in the format you had
mymat2 = as.matrix(mymat[,-1, with = F])
rownames(mymat2) = mymat$rn

mymat2
             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 "0/1"    "25/25"  "50"     "0/0"    "50,0"   "50"    
chr1:1888193 "1/1"    "0/50"   "50"     "0/0"    "50,0"   "50"