user441521 user441521 - 9 months ago 94
C# Question

CsvHelper and querying a large csv file in parallel

I have a 3.6 gig csv file. I'm using CsvHelper to work with it. When I use linq to query it, it takes a couple mins and I see the CPU only max out at about 25% on my PC. Linq seems to handle memory well when doing this as that doesn't go up much at all.

So I figured by adding .AsParallel() I should see some performance gains. When I run it with that I see my CPU go up to about 95% but it takes just as long.

Why would I not see performance gains with .AsParallel(), and is there any way to get better performance with this (leaving it as a csv).

string path = @"C:\my_3_gig_file.csv";

using (var csv = new CsvHelper.CsvReader(new StreamReader(path, Encoding.Default)))
csv.Configuration.Delimiter = ",";
csv.Configuration.TrimFields = true;
var records = csv.GetRecords<MyClass>();

var q1 = (from a in records.AsParallel()
where a.MY_HOUR == "1"
&& a.MY_DATE == "14-JUN-13"
select a).ToList();

Answer Source

You will not gain anything by trying to process the rows in parallel so there's no point in trying. You can only read and process the file using CsvHelper linearly. And even then, there's not enough work there to justify the parallel processing of the rows. What's probably hurting you is building up every single record that is read. If there's a lot of columns, that's a lot of processing per row.

You're trying to filter out rows to use from the file. You should read the rows in directly, check the columns, then build the records as you go. That way, you're not wasting a lot of time building out records for each row, only to potentially throw it away.

Here's one such way you could do this.

List<T> GetCsvRecordsFiltered<T>(string path, Func<CsvReader, bool> filter, Action<CsvConfiguration> configure = null) where T : class
    using (var file = File.OpenText(path))
    using (var reader = new CsvReader(file))
        var result = new List<T>();
        while (reader.Read())
            if (filter(reader))
        return result;

Then when you read the file, you'd do this:

var q1 = GetCsvRecordsFiltered<MyClass>(path,
    reader => reader["MY_HOUR"] == "1" && reader["MY_DATE"] == "14-JUN-13",
    config => config.TrimFields = true