Cleb Cleb - 5 months ago 21
Python Question

How to access a KEGG entry without specifying the organism using bioservices?

I try to access KEGG via bioservices to get certain information about a list of genes. Problem is that I do not know beforehand to which organism the individual genes belong; in my list can be a lot of genes that all belong to different organisms. My problem is that I do not know how to retrieve the desired information about the genes without specifying the organism.

To give an example:

gene_list = ['YMR293C', 'b3640']

The first gene belongs to yeast, while the second one belongs to E.coli.

If I now try:

from bioservices import *
kegg_con = KEGG()
res = kegg_con.get('b3640', parse=True)['NAME']

I end up with a

kegg_con.get('b3640', parse=True)

does not return a dictionary but just a number (since I do not specify the organism it belongs to). That works, however, when I specify the organism (here it is
which stands for E.coli):

kegg_con.get('eco:b3640', parse=True)['NAME']



which is correct as one can see here:

enter image description here

I then tried to get the information about the associated organism by using find. That works fine for
but fails for

kegg_con.find('genes', 'YMR293C')


u'sce:YMR293C\tHER2, GEP6, QRS1, RRG6; glutamyl-tRNA(Gln)
amidotransferase subunit HER2 (EC:; K02433
aspartyl-tRNA(Asn)/glutamyl-tRNA(Gln) amidotransferase subunit A
[EC:]\ncal:CaO19.11438\tlikely amidase similar to S.
cerevisiae YMR293C mitochondrial putative glutamyl-tRNA
amidotransferase\ncal:CaO19.3956\tlikely amidase similar to S.
cerevisiae YMR293C mitochondrial putative glutamyl-tRNA
amidotransferase; K02433 aspartyl-tRNA(Asn)/glutamyl-tRNA(Gln)
amidotransferase subunit A [EC:]\n'

from which I can easily extract the required information (in this case:
), however, when I run

kegg_con.find('genes', 'b3640')

I get

u'cnb:CNBB3640\thypothetical protein; K06316 oligosaccharide
translocation protein RFT1\ncgi:CGB_B3640C\thypothetical
protein\neco:b3640\tdut; deoxyuridinetriphosphatase (EC:;
K01520 dUTP pyrophosphatase [EC:]\nsea:SeAg_B3640\tbfd;
bacterioferritin-associated ferredoxin; K02192
bacterioferritin-associated ferredoxin\nyps:YPTB3640\tconserved
hypothetical protein\nreu:Reut_B3640\tconserved hypothetical
protein\nbbr:BB3640\tphage-related exported
protein\nbcg:BCG9842_B3640\tflagellar hook-associated protein; K02407
flagellar hook-associated protein 2\ncbi:CLJ_B3640\tconserved
hypothetical protein; K09963 uncharacterized
protein\nmmo:MMOB3640\thypothetical protein\nmbo:Mb3640c\tftsH;
membrane-bound protease FTSH (cell division protein) (EC:3.4.24.-);
K03798 cell division protease FtsH [EC:3.4.24.-]\n'

which does not provide the information about E.coli.

My questions are therefore:

1) Is there a way so that I can access the information about a gene just based on its gene ID without specifying the organism it belongs to?

2) What would be the best way to retrieve the information to which organism the gene belongs? And why does
fail when I search for the E.coli gene?


The output of the find() method is a pure string that is not easy to read but I believe the information you are looking for is in the output. On the third line, you can see:


Now, I am not sure if the output format from KEGG is always having the same structure. If so, assuming the line of interest is the third one, you could use:

res = kegg_con.find('genes', 'b3640') 
orgnanism = res.split("\n")[2].split()[0].split(":")[0]

You can further check it is a valid orgnanism as follows:

assert organism in kegg_con.organismIds

To be on the safe side, you could search for the identifier in the string (rather than taking the third line):

[x for x in res.split() if "b3640" in x]

Hopes it helps

TC, the main author of bioservices