Majgel - 19 days ago 4x
Vb.net Question

# N-gram function in vb.net -> create grams for words instead of characters

I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2).

I found an C# example of how to generate a n-gram from a text body so I started out with converting the code to VB. The problem is that this code does create one gram per character instead of one per word. The delimiters I want to use for the words is: VbCrLf (new line), vbTab (tabs) and the following characters: !@#\$%^&*()_+-={}|\:\"'?¿/.,<>’¡º×÷‘;«»[]

Does anyone have an idea how I can rewrite the following function for this purpose:

``````   Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
If text Is Nothing OrElse text.Length = 0 Then
Return Nothing
End If

Dim grams As New ArrayList()
Dim length As Integer = text.Length
If length < gramLength Then
Dim gram As String
For i As Integer = 1 To length
gram = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
End If
Next

gram = text.Substring(length - 1, (length) - (length - 1))
If grams.IndexOf(gram) = -1 Then

End If
Else
For i As Integer = 1 To gramLength - 1
Dim gram As String = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then

End If
Next

For i As Integer = 0 To (length - gramLength)
Dim gram As String = text.Substring(i, (i + gramLength) - (i))
If grams.IndexOf(gram) = -1 Then
End If
Next

For i As Integer = (length - gramLength) + 1 To length - 1
Dim gram As String = text.Substring(i, (length) - (i))
If grams.IndexOf(gram) = -1 Then
End If
Next
End If
End Function
``````

An n-gram for words is just a list of length n that stores these words. A list of n-grams is then simply a list of list of words. If you want to store frequency then you need a dictionary that is indexed by these n-grams. For your special case of 2-grams, you can imagine something like this:

``````Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#\$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For i As Integer = 0 To words.Length - 2
Dim ngram = New String() { words(i), words(i + 1) }
Dim oldValue As Integer = 0
frequencies.TryGetValue(ngram, oldValue)
frequencies(ngram) = oldValue + 1
Next
``````

`frequencies` should now contain a dictionary with all two consecutive word pairs contained in the text, and the frequency with which they appear (as a consecutive pair).

This code requires the `ArrayComparer` class:

``````Public Class ArrayComparer(Of T)
Implements IEqualityComparer(Of T())

Private ReadOnly comparer As IEqualityComparer(Of T)

Public Sub New()
Me.New(EqualityComparer(Of T).Default)
End Sub

Public Sub New(ByVal comparer As IEqualityComparer(Of T))
Me.comparer = comparer
End Sub

Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
Implements IEqualityComparer(Of T()).Equals
System.Diagnostics.Debug.Assert(a.Length = b.Length)
For i As Integer = 0 to a.Length - 1
If Not comparer.Equals(a(i), b(i)) Then Return False
Next

Return True
End Function

Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
Implements IEqualityComparer(Of T()).GetHashCode
Dim hashCode As Integer = 17
For Each obj As T In arr
hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
Next

Return hashCode
End Function
End Class
``````

Unfortunately, this code doesn’t compile on Mono because the VB compiler has problems finding the generic `EqualityComparer` class. I’m therefore unable to test whether the `GetHashCode` implementationw works as expected but it should be fine.