Identical Identical - 3 months ago 11
Python Question

Parsing with regex

I'm trying to count the number of lines contained by a file that looks like this:

-StartACheck
---Lines--
-EndACheck
-StartBCheck
---Lines--
-EndBCheck


with this:

count=0
z={}
for line in file:
s=re.search(r'\-+Start([A-Za-z0-9]+)Check',line)
if s:
e=s.group(1)
for line in file:
z.setdefault(e,[]).append(count)
q=re.search(r'\-+End',line)
if q:
count=0
break

for a,b in z.items():
print(a,len(b))


I want to basically store the number of lines present inside ACheck , BCheck etc in a dictionary but I keep getting the wrong output

Something like this

A,15
B,9


etc

Answer

You could consider using something like:

import re
from collections import defaultdict

counts = defaultdict(int)  # zero if key doesn't exists

for line in file:
    start = re.fullmatch('^Start([AB])Check\n$', line).groups()[0]
    end = re.fullmatch('^End([AB])Check\n$', line).groups()[0]
    if start:
        curr_key = group
    elif end:
        assert curr_key == group, "ending line {} doesn't match with an opening line for {}".format(line, curr_key)
        curr_key = None
    else:  # it's a normal line
        counts[curr_key] += 1

Bonus point: detect non-matching start-end lines + count lines outside start-end lines.

Without defaultdict

Replace else clause by:

    else:  # it's a normal line
        if curr_key in counts:
            counts[curr_key] += 1
        else:
            counts[curr_key] = 1

And define counts as a regular dict:

counts = {}

Fixing the given code

Given code seems working:

Here is a (apparently valid) file definition:

FILE = iter((  # generator of lines
    '-StartACheck',
    'a',
    'b',
    'c',
    '-EndACheck',
    '-StartBCheck',
    'a',
    'b',
    '-EndBCheck',
))

Here is the missing definitions:

import re
z = {}

And the provided code:

count=0
for line in FILE:
      s=re.search(r'\-+Start([A-Za-z0-9]+)Check',line)
      if s:
           e=s.group(1)
           for line in FILE:
               z.setdefault(e,[]).append(count)
               q=re.search(r'\-+End',line)
               if q:
                   count=0
                   break

for a,b in z.items():
    print(a,len(b))

Output is:

A 4
B 3

Which is accurate, as the first line (StartACheck) is counted:

      if s:
           e=s.group(1)
           for line in FILE:
               z.setdefault(e,[]).append(count)  # first called with the Start line

Error could be around the file lines extraction : if the file is read as:

file = tuple(open('filename.ext'))

Then the double for-loop of the source code iterates over each line of the file for each line of the file. Example:

filelines = (1, 2, 3, 4)
for line in filelines:
    for line in filelines:
        print(line)

And the (valid in this case) almost identical:

filelines = iter((1, 2, 3, 4))
for line in filelines:
    for line in filelines:
        print(line)