Hami Hami - 1 month ago 12
Python Question

Counting Gene segments in python and print them in columns

I need to convert a text file into species and counts of gene segments. For this I wanted to create a dictionary, filled with keys i searched with a pattern. Every key should have 3 items (digits) starting with 0. With another patterns, I want to look for the gene segments and if there is one, I want to increase the count for that.

I'm searching for 3 different gene segments, why I only want to increase

. Is there a way to do this with python?

That's the code I wrote till now, but I don't know how to continue.

matrix = {}
pattern = re.compile(r"[A-Za-z ]*")
pattern_v = re.compile(r";[A_Z]+V[0-9]?;")
pattern_d = re.compile(r";[A_Z]+D[0-9]?;")
pattern_j = re.compile(r";[A_Z]+J[0-9]?;")
for i in file.readlines():
name = pattern.search(i)
if pattern_v.search:
if name.group() not in matrix:
matrix.update(name.group(), (1,0,0))
matrix[(name.group()[0]] = matrix[(name.group()[0]]+1

As you can see, if
was found, I want to increase the item at position zero.
I know, that the last command doesn't work, I just wrote it to explain, what I want to do.

EDIT ADD: I got the algorithm working, but now i have the problem, that i cant print it like i want.

{'Mus cookii': [0, 0, 0], 'Ovis aries': [0, 7, 9], 'Camelus dromedarius': [2, 0, 0], 'Danio rerio': [1, 1, 5], 'Mus saxicola': [0, 0, 0], 'Homo sapiens': [21, 6, 33], 'Rattus norvegicus': [0, 1, 12], 'Sus scrofa': [0, 5, 13], 'Vicugna pacos': [0, 9, 7], 'Macaca nemestrina': [0, 0, 0], 'Mus spretus': [4, 0, 2], 'Mus musculus': [30, 5, 28], 'Mus minutoides': [0, 0, 0], 'Oncorhynchus mykiss': [0, 11, 16], 'Canis lupus familiaris': [4, 2, 0], 'Bos taurus': [2, 5, 12], 'Cercocebus atys': [0, 0, 0], 'Oryctolagus cuniculus': [0, 0, 10], 'Rattus rattus': [0, 0, 0], 'Ornithorhynchus anatinus': [0, 4, 9], 'Macaca mulatta': [1, 3, 16], 'Papio anubis anubis': [0, 0, 0], 'Macaca fascicularis': [0, 0, 0], 'Mus pahari': [0, 0, 0]}

is the output, but i need to make it more comfortable to read. The idea is to make a output with columns (name,v,d,j). I tried:

def printStatistics(dict):
for i in range(0,len(dict)):
print(" {0:30s}{1:30d}{2:30d}{3:30d}".format(dict[i],dict[i] [0],dict[i][1],dict[i][2]), sep = "")

but i get

"TypeError: non-empty format string passed to object.format"


You can make your algorithm work with collections.defaultdict:

input data

import re
from collections import defaultdict
import numpy as np

data= '''Bos taurus;TRGV8-1;F;Bos taurus T cell receptor gamma variable 8-1;1;4;4q3.1;AY644517;-;
Bos taurus;TRGV8-2;(F) F;Bos taurus T cell receptor gamma variable 8-2;2;4;4q3.1;AY644517;-;
Camelus dromedarius;TRDV1S3;F;Camelus dromedarius T cell receptor delta variable 1S3;1;-;-;FN298223;-;
Camelus dromedarius;TRDV1S4;F;Camelus dromedarius T cell receptor delta variable 1S4;2;-;-;FN298224;-;
Canis lupus familiaris;TRBD2;F;Canis lupus familiaris T cell receptor beta diversity 2;1;16;-;HE653929;-;'''
patterns = [
result = defaultdict(lambda:np.array([0,0,0]))


for line in data.splitlines():
    result[line.split(';')[0]]+=np.array([len(pattern.findall(line)) for pattern in patterns])


defaultdict(<function <lambda> at 0x7f622f81c140>, {'Camelus dromedarius': array([2, 0, 0]), 'Canis lupus familiaris': array([0, 1, 0]), 'Bos taurus': array([2, 0, 0])})

defaultdict works like a dictionary, but every key is initialized with a callable of your choice. lambda: [0,0,0] gives you the ability to immediately increment the group occurences instead of having to do update and increment.

I decided to work with numpy arrays because they support vector like adding operations which makes the algorithm prettier, you could also do it without numpy.