Zach Johnson Zach Johnson - 4 months ago 11
C# Question

Best way to remove duplicate strings with different spacing?

I have a list of strings that contain some duplicates. They are not EXACT duplicates as some contain spaces in different locations. Example of a list:

best shoes for flat feet
bestshoes for flat feet
best shoesfor flatfeet
best shoes for flatfeet

Now what I would like to do is remove all these duplicate strings, keeping only the one with the MOST spaces (we will assume this is the correct spacing).

Can anyone recommend me a way to accomplish this?

  • Start by constructing a "canonical" version from each string by removing all spaces (here is how to do it)
  • Use canonical version as a key to group your strings
  • Pick the longest string among the ones in the same group

You can do it with LINQ's GroupBy:

var res = orig
    .GroupBy(s => Regex.Replace(s, @"\s+", ""))
    .Select(g => g.OrderByDescending(s => s.Length).First())