Kay Carosfeild Kay Carosfeild - 8 months ago 54
Python Question

Using Python to Scrape Data Between Patterns

I have a data set, and I want to grab certain aspects of the data. For the first line and the first word if it is equal to

regex = re.compile(r'\A([A-Z][a-z][A-Z]\w*[-]\w*')
. How would I scrape that data between the lines (which are dashes) and keep the data with the line and remove the data that identifier that is not equal "regex".

For example: I want to keep the data within
AbD000000-10
and
DeD000000-10
but not
888888-10
.

-------------------------------------------------------------------------------

AbD000000-10
Issue 1
Issue 2 Q Q Q
ID: 2 MsEhdiehsla2 MsEhasdhsla2 hiGndiehsla2
ID: 3

-------------------------------------------------------------------------------
888888-10
Q Q Q
ID: 2 MsEhdiehsla2 MsEhasdhsla2 hiGndiehsla2
ID: 3

-------------------------------------------------------------------------------
DeD000000-10
Issue 1
Issue 2 Q Q Q
ID: 2 MsEhdiehsla2 MsEhasdhsla2 hiGndiehsla2
ID: 3

-------------------------------------------------------------------------------


I would like to see my output to look like:

-------------------------------------------------------------------------------

AbD000000-10
Issue 1
Issue 2 Q Q Q
ID: 2 MsEhdiehsla2 MsEhasdhsla2 hiGndiehsla2
ID: 3

-------------------------------------------------------------------------------
DeD000000-10
Issue 1
Issue 2 Q Q Q
ID: 2 MsEhdiehsla2 MsEhasdhsla2 hiGndiehsla2
ID: 3

-------------------------------------------------------------------------------


How would I do this in python?

I am able to grab all the information inside but is there a way to create segments of data that I can then play with.

Thank you!

Answer Source

I think your regex is broken (that \A doesn't belong).

In this approach, I assume that the separator will always be the same. I assume you don't want to break the blocks down any further. This grabs only the blocks you want. You can format them however is convenient (including printing the separator back out when you print the blocks).

import re

r = re.compile(r'([A-Z][a-z][A-Z]\w*[-]\w*')
sep = "#-------------------------------------------------------------------------------#"
input_text = """
#------------------------------------------------------------------------------#

AbD000000-10
Issue 1  
Issue 2          Q             Q            Q 
ID: 2             MsEhdiehsla2 MsEhasdhsla2  hiGndiehsla2
ID: 3 

#------------------------------------------------------------------------------#
888888-10
         Q             Q            Q 
ID: 2             MsEhdiehsla2 MsEhasdhsla2  hiGndiehsla2
ID: 3 

#------------------------------------------------------------------------------#
DeD000000-10
Issue 1  
Issue 2          Q             Q            Q 
ID: 2             MsEhdiehsla2 MsEhasdhsla2  hiGndiehsla2
ID: 3 

#------------------------------------------------------------------------------#
"""
s = input_text.split(sep)
keep = [x for x in s if re.search(r , x)]

for v in keep:
    print(v)

Really, though, if you can help it, it would be good to consume this data from a better source. If this is a log file, you may not have a lot of control over it. But if you can, see if you can get a cleaner source of the data (csv maybe?).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download