SCY SCY - 4 months ago 16
Python Question

reading a complicated CSV file with python

I didn't know how to title this question so if it should be renamed please inform me and I will.

I'm reading a

csv
file that I saved off from a piece of equipment I used in taking some measurements. The data and various other key information is all being saved off. I've worked all day long on this and I cannot figure out how to retrieve every piece of information from this file properly. I need to access each piece of data/information and after doing so plot the data and read in the other various information like the data/timestamp/timezone/modelnumber/serialnumber and so on... I apologize if this question is too generic but I'm at lost on how to resolve this.

I've typed up several versions of code so I'll list only what I've been able to get to work. I have no idea why I have to use
sep=delimiter
. I would think delimiter=',' would work but it doesn't. I've found from research that
header=none
because my file doesn't have headings.

Canopy is telling me that the engine 'C' will not work so as a result I specify 'python'. From the output of this code it seems to capture everything. But its telling me I only have one column and have no idea how to separate out all this information.

Here is part of my csv file.

! FILETYPE CSV
! VERSION 1.0 1
! TIMESTAMP Friday 15 April 2016 04:50:05
! TIMEZONE (GMT+08:00) Kuala Lumpur Singapore
! NAME Keysight Technologies
! MODEL N9917A
! SERIAL US52240515
! FIRMWARE_VERSION A.08.05
! CORRECTION
! Application SA
! Trace TIMESTAMP: 2016-04-15 04:50:05Z
! Trace GPS Info...
! GPS Latitude:
! GPS Longitude:
! GPS Seconds Since Last Read: 0
! CHECKSUM 1559060681
! DATA Freq SA Max Hold SA Blank SA Blank SA Blank
! FREQ UNIT Hz
! DATA UNIT dBm
BEGIN
2000000000,-62.6893499803169,0,0,0
2040000000,-64.1528386206532,0,0,0
2080000000,-63.7751897198055,0,0,0
2120000000,-63.663056855945,0,0,0
2160000000,-64.227155790167,0,0,0
2200000000,-63.874804848758,0,0,0
END


Here is my code:

import pandas as pd
df = pd.read_csv('/Users/_XXXXXXXXX_/Desktop/TP1_041416_C.csv',
sep='delimter',
header=None,
engine='python')

Answer

UPDATE: - this version will read and parse the header part first, it will generate info dictionary from the parsed header and will prepare skiprows list for the pd.read_csv() function

from collections import defaultdict
import io
import re
import pandas as pd
import pprint as pp

fn = r'D:\temp\.data\36671176.data'

def parse_header(filename):
    """
    parses useful (check `header_flt` variable) information from the header
           and saves it into `defaultdict`,
    generates `skiprow` list for `pd.read_csv()`,
    breaks after parsing the header, so the data block will NOT be read

    returns: parsed info as defaultdict obj, skiprows list    
    """
    # useful header information that will be saved into defaultdict
    header_flt = r'TIMESTAMP|TIMEZONE|NAME|MODEL|SERIAL|FIRMWARE_VERSION|Trace TIMESTAMP:'

    with open(fn) as f:
        d = defaultdict(list)
        i = 0
        skiprows = []
        for line in f:
            line = line.strip()
            if line.startswith('!'):
                skiprows.append(i)
                # parse: `key` (first RegEx group)
                #    and `value` (second RegEx group)
                m = re.search(r'!\s+(' + header_flt + ')\s+(.*)\s*', line)
                if m:
                    # print(m.group(1), '->', m.group(2))
                    # save parsed `key` and `value` into defaultdict
                    d[m.group(1)] = m.group(2)
            elif line.startswith('BEGIN'):
                skiprows.append(i)
            else:
                # stop loop if line doesn't start with: '!' or 'BEGIN'
                break
            i += 1
        return d, skiprows

info, skiprows = parse_header(fn)

# parses data block, the header part will be skipped
# NOTE: `comment='E'` - will skip the footer row: "END"
df = pd.read_csv(fn, header=None, usecols=[0,1], names=['freq', 'dBm'],
                 skiprows=skiprows, skipinitialspace=True, comment='E',
                 error_bad_lines=False)

print(df)
print('-' * 60)
pp.pprint(info)

OLD version: - reads the whole file in memory and parses it. It might cause problems with big files because it will allocate memory for the whole file content plus the resulting DF:

from collections import defaultdict
import io
import re
import pandas as pd
import pprint as pp

fn = r'D:\temp\.data\36671176.data'

header_pat = r'(TIMESTAMP|TIMEZONE|NAME|MODEL|SERIAL|FIRMWARE_VERSION)\s+([^\r\n]*?)\s*[\r\n]+'

def parse_file(filename):
    with open(fn) as f:
        txt = f.read()

    m = re.search(r'BEGIN\s*[\r\n]+(.*)[\n\r]+END', txt, flags=re.DOTALL|re.MULTILINE)
    if m:
        data = m.group(1)
        df = pd.read_csv(io.StringIO(data), header=None, usecols=[0,1], names=['freq', 'dBm'])
    else:
        df = pd.DataFrame()

    d = defaultdict(list)
    for m in re.finditer(header_pat, txt, flags=re.S|re.M):
        d[m.group(1)] = m.group(2)

    return df, d

df, info = parse_file(fn)
print(df)
pp.pprint(info)

Output:

         freq        dBm
0  2000000000 -62.689350
1  2040000000 -64.152839
2  2080000000 -63.775190
3  2120000000 -63.663057
4  2160000000 -64.227156
5  2200000000 -63.874805
defaultdict(<class 'list'>,
            {'FIRMWARE_VERSION': 'A.08.05',
             'MODEL': 'N9917A',
             'NAME': 'Keysight Technologies',
             'SERIAL': 'US52240515',
             'TIMESTAMP': 'Friday   15 April 2016 04:50:05',
             'TIMEZONE': '(GMT+08:00) Kuala Lumpur  Singapore'})