Nicholas Ibarra Nicholas Ibarra - 6 months ago 31
PowerShell Question

Quickly Split CSV line that is Comma Delimited with quotes around Strings in Powerhell

I have large CSV files that 0.5-2gb+ files I am trying to import with Powershell.

Data looks like so:

Name, Date, Value

"Joe, John", 2016-08-01, "value"

"Smith, Jane", 2016-08-01, "value"

I have this function

$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
$reader = new-object System.IO.StreamReader($csv)

while (($line = $reader.ReadLine()) -ne $null) {

# Use RegEx to only split on (,) outside quotes and remove quoted strings
$row = ($line -split ',(?=(?:[^"]|"[^"]*")*$)').Replace("`"","")

# Row Indicator

if (($i % 50000) -eq 0) {
Write-Host "$i rows have been processed in $($elapsed.Elapsed.ToString())."

Splitting the line by a comma "," works perfect as I get ~16K a second, but I need to only split outside of any quotes, so I implemented the regular expression, however the performance tanks to 900 rows a second.

I am looking for a more efficient way to loop through a CSV file that is comma delimited but has commas in the quotes that need to be excluded.


Import-Csv, as noted in the comments above, does not load everything into memory unless you ask it to. Like the example in the question it implements a stream reader and pushes the content it's read off to the output pipeline.

You will see significant memory usage if you do something like this:

$var = Import-Csv thefile.csv

After all, the content of the CSV has to go somewhere.

Whereas if you do something with the output pipeline there's less impact. e.g.

Import-Csv thefile.csv | ForEach-Object {

Finally, Import-Csv really doesn't work for you I have a CSV reader class along with a side-by-side implementation of Import-Csv called Indented.Text.Csv on github. This implementation provides a public class with a number of features I needed so I could process CSV files very quickly.