Saras Arya Saras Arya - 3 months ago 8
Python Question

How to parse certain text data?

I have a text file with such a format:

B2100 Door Driver Key Cylinder Switch Failure B2101 Head Rest Switch Circuit Failure B2102 Antenna Circuit Short to Ground`, plus 1000 lines more.


This is how I want it to be:

B2100*Door Driver Key Cylinder Switch Failure
B2101*Head Rest Switch Circuit Failure
B2102*Antenna Circuit Short to Ground
B2103*Antenna Not Connected
B2104*Door Passenger Key Cylinder Switch Failure


so that I can copy this data in LibreOffice Calc and it will format it into two columns of code and meaning each.

My thought process:

Apply a regular express over Bxxxx and put an asterisk in front of it (It acts as a delimiter) and a
\n
before the meaning (I don't know if that will work? ), and remove white-space till next character is encountered.

I am trying to isolate the B2100 and have failed till now. My naive attempt:

import re

text = """B2100 Door Driver Key Cylinder Switch Failure B2101 Head Rest Switch Circuit Failure B2102 Antenna Circuit Short to Ground B2103 Antenna Not Connected B2104 Door Passenger Key Cylinder Switch Failure B2105 Throttle Position Input Out of Range Low B2106 Throttle Position Input Out of Range High B2107 Front Wiper Motor Relay Circuit Short to Vbatt B2108 Trunk Key Cylinder Switch Failure"""
# text_arr = text.split("\^B[0-9][0-9][0-9][0-9]$\gi");
l = re.compile('\^B[0-9][0-9][0-9][0-9]$\gi').split(text)
print(l)


This outputs:

['B2100\tDoor Driver Key Cylinder Switch Failure B2101\tHead Rest Switch Circuit Failure B2102\tAntenna Circuit Short to Ground B2103\tAntenna Not Connected B2104\tDoor Passenger Key Cylinder Switch Failure B2105\tThrottle Position Input Out of Range Low B2106\tThrottle Position Input Out of Range High B2107\tFront Wiper Motor Relay Circuit Short to Vbatt B2108\tTrunk Key Cylinder Switch Failure']


How do I achieve the desired result?

To break it down further, what I want to do is this:

Break down everything into Code (B1001) and meaning (The text after it) array and then apply each operation (the
\n
thing) on it individually. If you have better ideas on how to do the whole thing, the better. I would love to hear it.

Answer

Basically, you want to:

  • Find any Bxxxx strings in the input.
  • Replace any whitespace before them with a newline.
  • Replace any whitespace after them with a *.

This can all be done with a single re.sub():

re.sub(r'\s*(B\d{4})\s*', r'\n\1*', text).strip()

Matching pattern:

\s*              # Any amount of whitespace
   (B\d{4})      # "B" followed by exactly 4 digits
           \s*   # Any amount of whitespace

Replacement pattern:

\n               # Newline
  \1             # The first parenthesized sequence from the matching pattern (B####)
    *            # Literal "*"

The purpose of the strip() is to prune any leading or trailing whitespace, including the newline that will result from the sub of the first B#### sequence.