Mary Mary - 2 months ago 9
Python Question

exporting some of data in a text file to a CSV file, while text file has different types of delimiters

TT1 4444 | Drowsy | 9 19 bit drowsy
TT2 45888 | Blurred see - hazy | 29 50 little seeing vision
TT4 45933 | Excessive upper pain | 62 78 pain problems


I want export part of the information to a excel sheet or CSV file. my expected CSV file is like this:

Column 1 Column 2 column 3
4444 Drowsy bit drowsy
45888 Blurred see - hazy little seeing vision
45933 Excessive upper pain pain problems


As you see, I do not need information in first, fourth, and fifth column of a text file.

UPDATE FOR THE QUESTION:
structure of information in some of the rows is as follows:

TT6 112397013 | ari | or 76948002|pain| 22 345 agony


The expected output is as follows:

Column 1 Column 2 column 3
112397013 air agony
76948002 pain agony


Any suggestion ? Thanks !

Answer

I'm assuming you can read in the data as a list of strings. The code parses them using regular expressions (re) to the desired output which you can then write out to a csv file:

import re

#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1  4444 | Drowsy | 9 19   bit drowsy",
         "TT2  45888 | Blurred see - hazy | 29 50 little seeing vision",
         "TT4  45933 | Excessive upper pain  | 62 78  pain problems"]

#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"

#Only looks for letters after a space
describe_pattern = "\s(\D.*)"

#Format the output lines
out_lines = []
for line in lines:
    split_line = line.split("|")
    tt_num = re.findall(tt_num_pattern,split_line[0])[0]

    state = split_line[1].strip() #Just trim edges of whitespace
    describe = re.findall(describe_pattern,split_line[2])[0]
    describe = describe.strip()

    out_line = tt_num+","+state+","+describe
    out_lines.append(out_line)

#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
    print out_line

output:

4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems

Glad this helped. Here is the update you asked for. Honestly it's not very good (flexible) code but it works:

import re

#read lines from file using:
#lines = my_file.readlines()
lines = ["TT1  4444 | Drowsy | 9 19   bit drowsy",
         "TT2  45888 | Blurred see - hazy | 29 50 little seeing vision",
         "TT4  45933 | Excessive upper pain  | 62 78  pain problems",
         "TT6 112397013 | air | or 76948002|pain| 22 345  agony"]

#Looks for TT some whitespace then numbers until another whitespace and vertical bar
tt_num_pattern = "TT.*\s([0-9].*?)\s"

#Only looks for letters after a space
describe_pattern = "\s(\D.*)"

#Format the output lines
out_lines = []
for line in lines:

    split_line = line.split("|")

    #If there is an 'or'
    if len(split_line) == 5:
        tt_num = split_line[2].replace("or","").strip()
        state = split_line[3].strip()
        describe = re.findall(describe_pattern,split_line[4])[0].strip()
        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)

        tt_num = re.findall(tt_num_pattern,split_line[0])[0]
        state = split_line[1].strip()
        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)


    #If there is no 'or'
    elif len(split_line) == 3:
        tt_num = re.findall(tt_num_pattern,split_line[0])[0]

        state = split_line[1].strip() #Just trim edges of whitespace
        describe = re.findall(describe_pattern,split_line[2])[0]
        describe = describe.strip()

        out_line = tt_num+","+state+","+describe
        out_lines.append(out_line)

#Print them out (would normally want to write to file after header line)
for out_line in out_lines:
    print out_line

updated output:

4444,Drowsy,bit drowsy
45888,Blurred see - hazy,little seeing vision
45933,Excessive upper pain,pain problems
76948002,pain,agony
112397013,air,agony