jax jax -4 years ago 129
Python Question

Extract lines from a huge text file through two identifier as start and end using python

I wrote a function to extract a particular block of text from a large text file, the example text is presented below:

ATP(1):C39(3) - A:TYR(58):CD2(67)
ATP(1):C39(3) - A:TYR(58):CE2(69)
ATP(1):C59(6) - A:ILE(61):CD1(100)
ATP(1):C59(6) - A:LYS(87):CE(344)

Hydrogen bonds:
Location of Donor | Sidechain/Backbone | Secondary Structure | Count
-------------------|--------------------|---------------------|-------
LIGAND | SIDECHAIN | OTHER | 1

RECEPTOR | BACKBONE | BETA | 1

Raw data:
ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92)

Hydrophobic contacts (C-C):
Sidechain/Backbone | Secondary Structure | Count
--------------------|---------------------|-------
SIDECHAIN | OTHER | 2
SIDECHAIN | BETA | 23

Raw data:
ATP(1):C39(3) - A:TYR(58):CD2(67)
ATP(1):C39(3) - A:TYR(58):CE2(69)
ATP(1):C59(6) - A:ILE(61):CD1(100)
ATP(1):C59(6) - A:LYS(87):CE(344)
ATP(1):C4(23) - A:PHE(209):CD1(1562)
ATP(1):C4(23) - A:PHE(209):CE1(1564)
ATP(1):C2(26) - A:PHE(209):CD2(1563)
ATP(1):C6(28) - A:PHE(209):CB(1560)
ATP(1):C6(28) - A:PHE(209):CG(1561)
ATP(1):C6(28) - A:PHE(209):CD1(1562)
ATP(1):C6(28) - A:VAL(286):CG2(2266)

pi-pi stacking interactions:
ATP(1):C8(30) - A:LYS(87):CG(342)
ATP(1):C8(30) - A:GLU(159):CD(1066)
ATP(1):C8(30) - A:PHE(209):CE1(1564)


I wrote a function to extract the chunk:

from itertools import islice

def start_end_points(file_name):


f = open(file_name)
lines = f.readlines()

for s, line in enumerate(lines):
if "Hydrogen bonds:" in line:
print s

for e, line in enumerate(lines):
if "pi-pi stacking interactions:" in line:
print e

print islice(lines, s, e)

start_end_points("foo.txt")


Is there a way to write this code more efficiently? Because I want to use this code as part of a Web tool, hence efficiency of the code is very important.

Thanks.

Answer Source

You have no reason to load the whole file to memory!

def start_end_points(file_name):
    with open(file_name) as f:
        found = False
        for line in f:
            if found or ("Hydrogen bonds:" in line):
                found = True
                print line
            if "pi-pi stacking interactions:" in line:
                break

start_end_points("foo.txt")

That way you keep only one buffer in memory, process each line once, and stop reading the file as soon as you have reached the pi-pi... line.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download