James Vici James Vici - 3 months ago 10
Python Question

Find and Edit Text File

I'm looking to find if there is a way of automating this process. Basically I have 300,000 rows of data needed to download on a daily basis. There are a couple of rows that need to be edited before it can be uploaded to SQL.


Jordan || Michael | 23 | Bulls | Chicago

Bryant | Kobe ||| 8 || LA


What I want to accomplish is to just have 4 vertical bars per row. Normally, I would search for a keyword then edit it manually then save. These two are the only anomalies in my data.


  1. Find "Jordan", then remove the excess 1 vertical bar "|" right after it.

  2. I need to find "Kobe", then remove the two excess vertical bars "|" right after it.



Correct format is below -


Jordan | Michael | 23 | Bulls | Chicago

Bryant | Kobe | 8 || LA


Not sure if this can be done in vbscript or Python.
Any help would be appreciated. Thanks!

Answer

Python or vbscript could be used but they are overkill for something this simple. Try sed:

$ sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file 
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA

To change the file in-place:

sed -Ei.bak 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file 

How it works

sed reads and processes a file line by line. In our case, we need only the substitute command which has the form s/old/new/g where old is a regular expression and, if it is found, it is replaced by new. The optional g at the end of the command tells sed to perform the substitution command 'globally', meaning not just once but as many times as it appears on the line.

  • s/(Jordan *)\|/\1/g

    This tells sed to look for Jordan followed by zero or more spaces followed by a vertical bar and remove the vertical bar.

    In more detail, the parens in (Jordan *) tell sed to save the string Jordan followed by zero or more spaces as a group. In the replacement side, we reference that group as \1.

  • s/(Kobe *)\| *\|/\1/g

    Similarly, this tells sed to look for Kobe followed by zero or more spaces followed by a vertical bar and remove the vertical bar.

Using python

Using the same logic as above, here is a python program:

$ cat kobe.py
import re
with open('file') as f:
    for line in f:
        line = re.sub(r'(Jordan *)\|', r'\1', line)
        line = re.sub(r'(Kobe *)\| *\|', r'\1', line)
        print(line.rstrip('\n'))
$ python kobe.py
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
Comments