Keo Rithy Keo Rithy -4 years ago 88
Python Question

Python. Regular expression not returning output

I am trying to

findall
instances of the string
"PB"
and the digits that follow it, but when I call.

number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))


the
([0-9])\d+
doesn't return an output. I check my output file,
sequence.txt
but there is nothing inside it. If i just do
\bPB\b
it outputs
"PB"
but no numbers.

My input file,
raw-sequence.txt
looks like this:

WB (19, 21, 24, 46, 60)
WB (12, 11, 9, 23, 49)
PB (18, 21, 10, 5, 5)
WB (2, 14, 2, 29, 67)
WB (1, 8, 1, 16, 52)
PB (2, 11, 8, 3, 4)


How can I output the following lines to sequence.txt?

PB (18, 21, 10, 5, 5)
PB (2, 11, 8, 3, 4)


Here is my current code:

sequence_raw_buffer = open('c:\\sequence.txt', 'a')
with open('c:\\raw-sequence.txt') as f:
number_list = f.read().splitlines()
number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))
unique = list(set(number_all))
for i in unique:
sequence_raw_buffer.write(i + '\n')
print "done"
f.close()
sequence_raw_buffer.close()

Answer Source

Given the code you show, regex are an unnecessary over-complication to your problem. You can just iterate over the lines from the input file and dump the ones for which line.startswith("PB") returns True.

with open(r'c:\raw-sequence.txt', 'r') as f, open(r'c:\sequence.txt', 'a') as sequence_raw_buffer:
    for line in f:
        if line.startswith("PB"):
            print(line, file=sequence_raw_buffer)

This illustrates the fact that files can be iterated over line-by-line. I use print to dump the line because it will append the correct line terminator that the for loop strips off.

This example also shows you how to put multiple context managers into a single with block. You should have all your file in a with block, whether input or output, because I/O errors are a possibility in both directions.

Now, if you are trying to use regex for practice or because the match is really more complicated than what you present here, you can try

PB\s*\((?:\d+,\s*)*\d+\)

This matches as follows:

  • Literal PB
  • Optional unlimited number of spaces \s*
  • Literal open parens \(
  • Optional non-capturing group (?:)*, repeated as many times as necessary, containing
    • At least one digit \d+
    • Literal comma ,
    • Any number of spaces \s*
  • At least one actual number \d
  • Literal close parens \)

I would not bother concatenating the whole file together and using findall on that though, unless your expression can span multiple lines. I would prefer to still use the approach shown above, because in all but a few cases that I can think of, textual data will generally be delimited by newlines:

pattern = re.compile('PB\s*\((?:\d+,\s*)*\d+\)')
...
            if pattern.match(line):
...

Pre-compiling the pattern once makes the program run faster, but you could call re.match(..., line) every time as well.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download