JohnBigs JohnBigs - 19 days ago 10
Python Question

regex to fix csv quotes

I have a simple csv with quotes, something like:


"something","something","something","something",...


BUT, sometimes I get csv with


"something","som"ething"","s"omething",...


and I wanted to create a regex that will fix this problem, does someone have something to offer?

something that will take out everything out from the string that is not a number or text, but when I take out
"
I need to make sure its not the ones that bounds the string cause i need those..

so from
"som"ething"","s"ometh8 ing"
id expect =>
"something","someth8 ing"


im using scala but any solution will be great!

thanks!!

Answer

A simple solution in Scala:

scala> val input = """"som"ething"","s"ometh8 ing""""
input: String = "som"ething"","s"ometh8 ing"

scala> val values = input.split("\",\"").map(_.filter(c => c.isLetterOrDigit || c.isWhitespace))
values: Array[String] = Array(something, someth8 ing)

scala> val output = values.mkString("\"", "\",\"", "\"")
output: String = "something","someth8 ing"

Assuming you never have "," inside your values, but if you do then there's no way to fix your CSV unambiguously anyway.

This isn't the most optimal solution speed or memory-wise, but it's short and simple.