Will Will - 2 months ago 9
Python Question

Pandas missing rows when importing using delimiter ="|"

I have a dataset with 51,347 rows. When import the data using pandas and set the delimiter to "|" , I lose 394 rows.

import pandas as pd
df = pd.read_csv("Basin11.txt", sep='|', error_bad_lines=False,
dtype={'Start Date': str, 'Greater Than/Less Than': str,
'Parameter Code': float, 'Start Time': str, 'Start Depth': float, 'Composite Category': str,
'Composite Type': str})
print(len(df.index))


If I remove the sep variable, the data won't load as multiple columns but will load the proper number of rows. It only seems to be an issue for this file.
Basin11.txt File

Does anyone know why I'm losing data?

Answer

I started going through your input file and found a number of errors that may be leading to the "missing rows".

The comments line 3491 and 9805 have an opening " but is missing the closing ". This would cause matching issues, including the following rows as part of the comment body. As I started fixing those, the line counts started going up. There are probably more cases of this.

Also, some lines have double-double quotes ("") for opening and closing comments. For example:

""green, med tide, 10-15 mph winds""

Edit: I added the following code:

for comment in df['Comments'].values:
    print(comment)

Then ran python3 sample.py | grep '|' | wc -l, to find the number of comments that contained |, and got 394 (The number of rows that you are missing)

Comments