DefColin DefColin - 9 months ago 39
C# Question

Remove all duplicate lines from a text document

I' trying to remove all duplicate rows from small text document with 300 rows:

string[] lines = File.ReadAllLines("doc1.txt");
File.WriteAllLines("doc1.txt", lines.Distinct().ToArray());

and this way:

List<string> lines = File.ReadAllLines(mypath).ToList();
File.WriteAllLines(mypath, lines.Distinct().ToArray());

but only some rows are removed, duplicates are still there, seems like in two cases, if duplicates is near to each other, or if too far form each other. Not sure what I'm doing wrong

all rows in lowercase without any punctuation, allowed only one white space between words and with trim for last and first space or punctuation in string.

So if I got two similar rows one after another this codes does not works for me, if one duplicate located after different string previous to its double, then works, and if duplicate is row 7 in order and is equal is row 287, does not works

Answer Source

Maybe the lines aren't exactly the same, there might be leading/trailing white-spaces.

You can remove white-spaces by invoking the Trim() method on the string objects representing a line:

File.WriteAllLines("doc1.txt", lines.Select(line => line.Trim()).Distinct().ToArray());