AGG - 5 months ago 10

R Question

I have a output data, where in each row there are multiple isoforms for each gene. Isoforms are seperated by comma ','. When I import the table to R, data frame looks like as below.

`Df:`

gene isoform sample1_read_number p-value

A 'A1','A2','A3' 0:23,1:12,2:122 0.9,0.01,0.5

B 'B1','B2','B3' 0:3,1:45,2:76 0.43,0.001,0.12

C 'C1','C2','C3','C4' 0:5,1:56,2:166,3:7 0.004,0.002,0.23,0.12

D 'D1','D2' 0:43,1:100 0.1,0.0003

For each gene, there are multiple isoforms. For each isoform, I have read numbers, seperated by comma (0:23 read for A1 meaning A1 read is 23) and p-values seperated by comma (p-value for A1 is 0.9 and A2 is 0.01). So everything is in an order by comma separation in each object.

For example when I call,

`df[1,2]`

`[1] 'A1','A2','A3''`

or

`df[1,4]`

`[1] 0.9,0.01,0.5`

The reason I want to do this is because, I want to filter this data to based on p-value or read number. To be able to do that, first I should be able to break this data frame by each isoform and to do that I need to find a way to separate values on each spot.

Final data frame should be like that (only showing for gene A and B here):

`Df_I:`

gene isoform sample1_read_number p-value

A A1 0:23 0.9

A A2 1:12 0.01

A A3 2:122 0.5

B B1 0:3 0.43

B B2 1:45 0.001

B B3 2:76 0.12

Anybody can give me ideas to make this second data frame?

Any help would be appreciated a lot!

Cheers!

A

Answer

This can be easily done with `cSplit`

from `splitstackshape`

```
library(splitstackshape)
na.omit(cSplit(Df, 2:ncol(Df), ",", "long"))
# gene isoform sample1_read_number p.value
# 1: A A1 0:23 0.9000
# 2: A A2 1:12 0.0100
# 3: A A3 2:122 0.5000
# 4: B B1 0:3 0.4300
# 5: B B2 1:45 0.0010
# 6: B B3 2:76 0.1200
# 7: C C1 0:5 0.0040
# 8: C C2 1:56 0.0020
# 9: C C3 2:166 0.2300
#10: C C4 3:7 0.1200
#11: D D1 0:43 0.1000
#12: D D2 1:100 0.0003
```