Acavier Acavier - 2 months ago 7 Question

VB.NET Find Duplicate Lines in a Text File

I work for a company that deals with various import files of all different sizes. I would like to develop a pre-check on these files to find and identify any duplicate lines (where the entire line matches another line in the file). I have written code for this already, but as the line count of the file gets above 100,000, the code starts to really slow down. How can I make this code run faster and keep the code simple?

Dim sr As New StreamReader(txtFile.Text)
While Not sr.EndOfStream
i += 1
' Save the header of the file if requested
If chkKeepHeader.Checked And i = 1 Then
sHLine = sr.ReadLine
End If
sLine = sr.ReadLine

' Compare the current line with the previous lines read
If lstDistLines.Contains(sLine) Then
iDupCount += 1
lstDupLines.Add(i & "," & sLine)
End If

' Update the display at regular intervals
If i Mod (50) < 1 Then
lblProcessCount.Text = i
End If
End While
sr = Nothing


If you insist on keeping track of the process ( updating the lblProcessCount and Application.DoEvents() take a lot of the time ), you can use HashSet instead of lstDistLines to store the lines. HashSet does not allow duplicates, but checking if it contains an item takes almost the same time no matter how many items you add to it.

how to remove duplicate line from text file