GoodJuJu GoodJuJu - 1 year ago 52 Question

Fastest way to compare a large amount of text against 125,000 database records

The aim of my application is to extract text from documents and search for specific entries matching records in a database.

  1. My application extracts text from documents and populates a textbox
    with the extracted text.

  2. Each document can have anywhere from 200 to 600,000 words
    (including a large amount of normal plain text).

  3. Extracted text is compared against database entries for specific
    values and matches are pushed into an array.

  4. My Database contains approximately 125,000 records

My code below loops through the database records, comparing against the extracted text. If a match is found in the text it is inserted into an array which I use later.

txtBoxExtraction.Text = "A whole load of text goes in here, " & _
"including the database entries I am trying to match," & _
"i.e. AX55F8000AFXZ and PP-Q4681TX/AA up to 600,000 words"

Dim dv As New DataView(_DBASE_ConnectionDataSet.Tables(0))
dv.Sort = "UNIQUEID"

'There are 125,000 entries here in my sorted DataView dv e.g.

For i = 0 to maxFileCount

Dim path As String = Filename(i)

If File.Exists(path) Then
Using sr As New StreamReader(path)
txtBoxExtraction.Text = sr.ReadToEnd()
End using
Catch e As Exception
Console.WriteLine("The process failed: {0}", e.ToString())
End Try
end if

For dvRow As Integer = 0 To dv.Table.Rows.Count - 1
strUniqueID = dv.Table.Rows(dvRow)("UNIQUEID").ToString()
If txtBoxExtraction.Text.ToLower().Contains(strUniqueID.ToLower) Then
' Add UniqueID to array and do some other stuff..
End if
next dvRow

next i

Whilst the code works, I am looking for a faster way of performing the database matching (the 'For dvRow' Loop).

If a document is small with around 200 words, the 'For dvRow..' Loop completes quickly, within a few seconds.

Where the document contains a large amount of text... 600,000 words and upwards, it can take several hours or longer to complete.

I came across a couple of posts that are similar, but not close enough to my issue to implement any of the recommendations.

High performance "contains" search in list of strings in C#

Any help is most gratefully appreciated.

Answer Source

This is an example of the comment a wrote.

If that's the actual code, I don't understand why you need to put the information in a textbox. You could save a bit of speed by not displaying the text on screen. If you have 125000 UNIQUEIDs, then it might be better to pull the id from your file and then search from that list. Instead of searching the whole text every time. Even just splitting your text by space and filtering by the "words" that are between a specific size could make it go faster.

Since it seems you want to do a word check and not a per-character check. And that you only want to check for those ids and not each word. You should pull up the ids from each text before doing any search. This will reduce the searching that need to be done by a lot. This list of id could also be saved if the text never changes.

Module Module1

    Private UNIQUEID_MIN_SIZE As Integer = 8
    Private UNIQUEID_MAX_SIZE As Integer = 12

    Sub Main()

        Dim text As String
        Dim startTime As DateTime
        Dim uniqueIds As List(Of String)

        text = GetText()
        uniqueIds = GetUniqueIds()

        '--- Very slow

        startTime = DateTime.Now

        ' Search
        For Each uniqueId As String In uniqueIds

        Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)

        '--- Very fast

        startTime = DateTime.Now

        ' Split the text by words
        Dim words As List(Of String) = text.Split(" ").ToList()

        ' Get all the unique key, assuming keys are between a specific size
        Dim uniqueIdInText As New Dictionary(Of String, String)

        For Each word As String In words
            If word.Length < UNIQUEID_MIN_SIZE Or word.Length > UNIQUEID_MAX_SIZE Then
                If Not uniqueIdInText.ContainsKey(word) Then
                    uniqueIdInText.Add(word, "")
                End If
            End If

        ' Search
        For Each uniqueId As String In uniqueIds

        Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)


    End Sub

    ' This only randomly generate words for testing
    ' You can ignore
    Function GetRandomWord(ByVal len As Integer) As String

        Dim builder As New System.Text.StringBuilder
        Dim alphabet As String = "abcdefghijklmnopqrstuvwxyz"
        Dim rnd As New Random()

        For i As Integer = 0 To len - 1
            builder.Append(alphabet.Substring(rnd.Next(0, alphabet.Length - 1), 1))

        Return builder.ToString()
    End Function

    Function GetText() As String

        Dim builder As New System.Text.StringBuilder
        Dim rnd As New Random()

        For i As Integer = 0 To 600000
            builder.Append(GetRandomWord(rnd.Next(1, 15)))
            builder.Append(" ")

        Return builder.ToString()
    End Function

    Function GetUniqueIds() As List(Of String)

        Dim wordCount As Integer = 600000
        Dim ids As New List(Of String)
        Dim rnd As New Random()

        For i As Integer = 0 To 125000
            ids.Add(GetRandomWord(rnd.Next(UNIQUEID_MIN_SIZE, UNIQUEID_MAX_SIZE)))

        Return ids
    End Function

End Module