s.matthew.english s.matthew.english - 7 months ago 14
Python Question

fast, accurate, reliable way to remove undesirable values from a csv file

I have a large-ish csv file that has a substantial ammount of dirty data in it, I'd like to clean it up a bit by eliminating all the values that are not absolutely necessary.

Here is the file I'm talking about.

It has the components:

Website
,
Title
,
Start Date
,
Employer
,
Location
,
lat
,
lon
,
Country
,
Skills11
,
Jobs


but I would like to obliterate all but:

Employer
,
Location
,
Country
,
Jobs


Is there a particular tool that is uniquely suited for this task?

Or maybe someone has a handy Python script that can get the job done?

Answer

You can easily do it with python writing to a temporary file then replacing the original.

import  csv
from operator import itemgetter
from tempfile import NamedTemporaryFile
from shutil import move

with open("edsa_data.csv") as f, NamedTemporaryFile(dir=".", delete=False) as tmp:
    next(f)
    tmp.write("Employer,Location,Country,Jobs\n")
    csv.writer(tmp).writerows(map(itemgetter(3, 5, 7, 9), csv.reader(f)))
move(tmp.name, "edsa_data.csv")

For a more generic approach:

import csv
from operator import itemgetter
from tempfile import NamedTemporaryFile
from shutil import move




def keep_columns(csv_f, keep_cols, **kwargs):
    with open(csv_f) as f, NamedTemporaryFile("w", dir=".", delete=False) as tmp:
        reader = csv.reader(f, **kwargs)
        wr = csv.writer(tmp,  **kwargs)
        wr.writerow(itemgetter(*keep_cols)(next(reader)))
        csv.writer(tmp).writerows(itemgetter(*keep_cols)(row) for row in reader)      
        move(tmp.name, csv_f)


keep_columns("edsa_data.csv", (3, 4, 7, 9))

For the kwargs you can pass sep="," skipinitialspace=True etc..

Comments