sursek sursek - 3 months ago 8
R Question

Reorganize data in R

I have one problem that I can't solve it by myself. Here http://www.filedropper.com/data_31 you can download my data. It's small txt file with information about Pathway, Seqs in Pathway, Enzyme, Enzyme ID, Seqs of Enzyme, Seqs Pathway ID.

I would like to reshape/reorganize my data so it would look like something like this:

NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids
NODE_11280... map00592 alpha-Linolenic acid metabolism NA NA
NODE_1307.... NA NA map01040 Biosynthesis of unsaturated fatty acids


The problem is I don't know how to reorganize my data from this

NODE_12982_length_530_cov_49.8358_ID_25963,NODE_24530_length_385_cov_7.38485_ID_49059,NODE_44451_length_263_cov_34.6298_ID_88901,NODE_19986_length_437_cov_5.82461_ID_39971,
NODE_28195_length_354_cov_77.194_ID_56389


to this

NODE_12982_length_530_cov_49.8358_ID_25963
NODE_24530_length_385_cov_7.38485_ID_49059
NODE_44451_length_263_cov_34.6298_ID_88901
NODE_19986_length_437_cov_5.82461_ID_39971
NODE_28195_length_354_cov_77.194_ID_56389


and how to add additional information about Pathway and Pathway ID to each Seqs (NODE...).

Thank you for your help!

EDITED



Thanks Imo & nilsole for your answers but you missing the point.

Here is my code for data:

Pathway<-rep(c("alpha-Linolenic acid metabolism","Biosynthesis of unsaturated fatty acids"), each=5)
Seq<-c("NODE_12982_length_530_cov_49.8358_ID_25963, NODE_24530_length_385_cov_7.38485_ID_49059, NODE_44451_length_263_cov_34.6298_ID_88901, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_28195_length_354_cov_77.194_ID_56389","NODE_8410_length_627_cov_229.406_ID_16819, NODE_3911_length_812_cov_32.037_ID_7821, NODE_13098_length_528_cov_13.4376_ID_26195, NODE_956_length_1151_cov_11.6797_ID_1911, NODE_4501_length_777_cov_61.2355_ID_9001, NODE_60851_length_208_cov_61.9935_ID_121701, NODE_50593_length_239_cov_608.397_ID_101185, NODE_29294_length_345_cov_1.22069_ID_58587, NODE_57887_length_216_cov_22.6087_ID_115773, NODE_14782_length_501_cov_3.03139_ID_29563, NODE_18662_length_451_cov_798.495_ID_37323, NODE_26461_length_368_cov_3.02556_ID_52921, NODE_56026_length_221_cov_2.91566_ID_112051, NODE_12405_length_540_cov_270.652_ID_24809, NODE_2990_length_874_cov_45.3675_ID_5979, NODE_4753_length_763_cov_7.11864_ID_9505, NODE_17275_length_467_cov_4.0267_ID_34549, NODE_21751_length_416_cov_41.4155_ID_43501, NODE_53355_length_230_cov_19.48_ID_106709, NODE_49191_length_244_cov_1.51852_ID_98381"
,"NODE_61001_length_208_cov_76.3987_ID_122001, NODE_14350_length_507_cov_66.9845_ID_28699, NODE_16148_length_482_cov_189.293_ID_32295, NODE_42206_length_273_cov_135.404_ID_84411, NODE_11280_length_561_cov_335.174_ID_22559, NODE_21858_length_415_cov_31.0306_ID_43715, NODE_824_length_1186_cov_6.48364_ID_1647, NODE_41473_length_276_cov_2.73303_ID_82945, NODE_46025_length_257_cov_166.455_ID_92049",
"NODE_32320_length_325_cov_56.6037_ID_64639, NODE_38741_length_289_cov_27.1795_ID_77481, NODE_9047_length_611_cov_12.6511_ID_18093, NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_47802_length_250_cov_40.8513_ID_95603, NODE_60092_length_210_cov_142.471_ID_120183, NODE_28312_length_353_cov_8.38926_ID_56623",
"NODE_4925_length_754_cov_2.56509_ID_9849, NODE_16010_length_484_cov_322.536_ID_32019, NODE_51261_length_237_cov_33.9011_ID_102521, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_1384_length_1058_cov_1.86939_ID_2767",
"NODE_12982_length_530_cov_49.8358_ID_25963, NODE_24530_length_385_cov_7.38485_ID_49059, NODE_44451_length_263_cov_34.6298_ID_88901, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_28195_length_354_cov_77.194_ID_56389",
"NODE_32320_length_325_cov_56.6037_ID_64639, NODE_38741_length_289_cov_27.1795_ID_77481, NODE_9047_length_611_cov_12.6511_ID_18093, NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_47802_length_250_cov_40.8513_ID_95603, NODE_60092_length_210_cov_142.471_ID_120183, NODE_28312_length_353_cov_8.38926_ID_56623",
"NODE_1114_length_1113_cov_24.6059_ID_2227, NODE_28195_length_354_cov_77.194_ID_56389",
"NODE_1307_length_1072_cov_19.1504_ID_2613, NODE_3418_length_843_cov_15.3959_ID_6835","NODE_4925_length_754_cov_2.56509_ID_9849, NODE_16010_length_484_cov_322.536_ID_32019, NODE_51261_length_237_cov_33.9011_ID_102521, NODE_19986_length_437_cov_5.82461_ID_39971, NODE_1384_length_1058_cov_1.86939_ID_2767")
Pathway_ID<-rep(c("map00592","map01040"),each=5)
df<-data.frame(Pathway,Seq,Pathway_ID)


Data looks like this:

Pathway Seq Pathway_ID
aplha-Linolenic acid metabolism NODE_12982...,NODE_8410.. map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
aplha-Linolenic acid metabolism NODE....,NODE... map00592
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040
Biosynthesis of unsaturated fatty acids NODE....,NODE... map01040


I would like to look like this:

NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids
NODE_11280... map00592 alpha-Linolenic acid metabolism NA NA
NODE_1307.... NA NA map01040 Biosynthesis of unsaturated fatty acids


With
strsplit
function I lost information to which
NODE...
each
Pathway
and
Pathway_ID
belongs to. In Seq column are different numbers of
NODEs
, also some
NODEs
belong to both pathways in such a case I want both pathways to be assign to specific
NODE
; like this

NODE_1114.... map00592 alpha-Linolenic acid metabolism map01040 Biosynthesis of unsaturated fatty acids


I hope you can help me out! Thank you!

Answer

I have found the solution with help of reshape2 package and some base elements such as cbind, unique and subset.

df<-data.frame(Pathway=rep(c("acid_metabolism", "fatty acids biosynthesis"), each=5),
                 Seq=c("Contig_A, Contig_B, Contig_C, Contig_D", "Contig_C, Contig_E,
                        Contig_F,","Contig_D, Contig_F, Contig_G, Contig_H, Contig_I,
                        Contig_J, Contig_K","Contig_C, Contig_D","Contig_H, Contig_I,
                        Contig_J","Contig_H, Contig_I, Contig_L, Contig_M","Contig_C",
                       "Contig_F, Contig_G, Contig_N","Contig_E, Contig_F, Contig_D",
                       "Contig_N, Contig_O"),Path_ID=rep(c("map_A","map_B"),each=5))
> head(df)
                   Pathway
1          acid_metabolism
2          acid_metabolism
3          acid_metabolism
4          acid_metabolism
5          acid_metabolism
6 fatty acids biosynthesis
                                                                   Seq Path_ID
1                               Contig_A, Contig_B, Contig_C, Contig_D   map_A
2                                        Contig_C, Contig_E, Contig_F,   map_A
3 Contig_D, Contig_F, Contig_G, Contig_H, Contig_I, Contig_J, Contig_K   map_A
4                                                   Contig_C, Contig_D   map_A
5                                         Contig_H, Contig_I, Contig_J   map_A
6                               Contig_H, Contig_I, Contig_L, Contig_M   map_B

Step 1: Spliting data from one column...

                                      Seq 
1  Contig_A, Contig_B, Contig_C, Contig_D   

to multi columns...

         V1        V2        V3        V4
1  Contig_A  Contig_B  Contig_C  Contig_D  

The problem with my data are strings with different number of column entries. Answer from @G. Grothendieck to Split strings into columns in R where each string has a potentially different number of column entries question helped me out.

df2<-cbind(df, read.table(text = as.character(df$Seq), sep = ",", fill = TRUE, as.is = TRUE))
> head(df2)
                   Pathway
1          acid_metabolism
2          acid_metabolism
3          acid_metabolism
4          acid_metabolism
5          acid_metabolism
6 fatty acids biosynthesis
                                                                   Seq Path_ID       V1
1                               Contig_A, Contig_B, Contig_C, Contig_D   map_A Contig_A
2                                        Contig_C, Contig_E, Contig_F,   map_A Contig_C
3 Contig_D, Contig_F, Contig_G, Contig_H, Contig_I, Contig_J, Contig_K   map_A Contig_D
4                                                   Contig_C, Contig_D   map_A Contig_C
5                                         Contig_H, Contig_I, Contig_J   map_A Contig_H
6                               Contig_H, Contig_I, Contig_L, Contig_M   map_B Contig_H
         V2        V3        V4        V5        V6        V7
1  Contig_B  Contig_C  Contig_D                              
2  Contig_E  Contig_F                                        
3  Contig_F  Contig_G  Contig_H  Contig_I  Contig_J  Contig_K
4  Contig_D                                                  
5  Contig_I  Contig_J                                        
6  Contig_I  Contig_L  Contig_M

Step 2: Reshape data so each Contig becomes a string with additional information of Pathway and Path_ID. melt function from reshape2 solved this out.

df2.m<-melt(df2, id.var = c("Pathway","Path_ID"))
> df2.m[11:20,]
                    Pathway Path_ID variable    value
11          acid_metabolism   map_A       V1 Contig_A
12          acid_metabolism   map_A       V1 Contig_C
13          acid_metabolism   map_A       V1 Contig_D
14          acid_metabolism   map_A       V1 Contig_C
15          acid_metabolism   map_A       V1 Contig_H
16 fatty acids biosynthesis   map_B       V1 Contig_H
17 fatty acids biosynthesis   map_B       V1 Contig_C
18 fatty acids biosynthesis   map_B       V1 Contig_F
19 fatty acids biosynthesis   map_B       V1 Contig_E
20 fatty acids biosynthesis   map_B       V1 Contig_N 

Step 3: Subseting data to get rid of unnecessary column (variable) and conciliate together Pathway and Path_ID.

df2.m.subset<-subset(df2.m, select=c("Pathway","Path_ID","value"))
df2.m.subset2<-data.frame(df2.m.subset$value, 
                          KEGG=paste(df2.m.subset$Pathway,df2.m.subset$Path_ID,sep="; "))
> df2.m.subset2[11:20,]
   df2.m.subset.value                            KEGG
11               Contig_A          acid_metabolism; map_A
12               Contig_C          acid_metabolism; map_A
13               Contig_D          acid_metabolism; map_A
14               Contig_C          acid_metabolism; map_A
15               Contig_H          acid_metabolism; map_A
16               Contig_H fatty acids biosynthesis; map_B
17               Contig_C fatty acids biosynthesis; map_B
18               Contig_F fatty acids biosynthesis; map_B
19               Contig_E fatty acids biosynthesis; map_B
20               Contig_N fatty acids biosynthesis; map_B

Step 4: Reshape data in a way each Contig will have assigned all KEGG pathways. For that I used dcast function from reshape2 package. dcast requires unique values. So I used unique from base package...

df2.u<-unique(df2.m.subset2)

...and then dcast

df2.d<-dcast(df2.u, df2.m.subset.value~KEGG, value.var="KEGG")
> df2.d[3:11,]
       df2.m.subset.value acid_metabolism; map_A fatty acids biosynthesis; map_B
3                Contig_C acid_metabolism; map_A                            <NA>
4                Contig_D acid_metabolism; map_A fatty acids biosynthesis; map_B
5                Contig_E acid_metabolism; map_A                            <NA>
6                Contig_F acid_metabolism; map_A fatty acids biosynthesis; map_B
7                Contig_G acid_metabolism; map_A fatty acids biosynthesis; map_B
8                Contig_H acid_metabolism; map_A                            <NA>
9                Contig_I acid_metabolism; map_A fatty acids biosynthesis; map_B
10               Contig_J acid_metabolism; map_A                            <NA>
11               Contig_K acid_metabolism; map_A                            <NA>