N-gram function in vb.net -> create grams for words instead of characters_问答_开发者

I recently found out about n-grams and the cool possibility to compare frequency of phrases in a text body with it. Now I'm trying to make an vb.net app that simply gets an text body and returns a list of the most frequently used phrases (where n >= 2).

I found an C# example of how to generate a n-gram from a text body so I started out with converting the code to VB. The problem is that this code does create one gram per cha开发者_运维百科racter instead of one per word. The delimiters I want to use for the words is: VbCrLf (new line), vbTab (tabs) and the following characters: !@#$%^&*()_+-={}|\:\"'?¿/.,<>’¡º×÷‘;«»[]

Does anyone have an idea how I can rewrite the following function for this purpose:

   Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
    If text Is Nothing OrElse text.Length = 0 Then
        Return Nothing
    End If

    Dim grams As New ArrayList()
    Dim length As Integer = text.Length
    If length < gramLength Then
        Dim gram As String
        For i As Integer = 1 To length
            gram = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        gram = text.Substring(length - 1, (length) - (length - 1))
        If grams.IndexOf(gram) = -1 Then
            grams.Add(gram)

        End If
    Else
        For i As Integer = 1 To gramLength - 1
            Dim gram As String = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)

            End If
        Next

        For i As Integer = 0 To (length - gramLength)
            Dim gram As String = text.Substring(i, (i + gramLength) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        For i As Integer = (length - gramLength) + 1 To length - 1
            Dim gram As String = text.Substring(i, (length) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next
    End If
    Return Tokeniser.ArrayListToArray(grams)
End Function

An n-gram for words is just a list of length n that stores these words. A list of n-grams is then simply a list of list of words. If you want to store frequency then you need a dictionary that is indexed by these n-grams. For your special case of 2-grams, you can imagine something like this:

Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
                             ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For i As Integer = 0 To words.Length - 2
    Dim ngram = New String() { words(i), words(i + 1) }
    Dim oldValue As Integer = 0
    frequencies.TryGetValue(ngram, oldValue)
    frequencies(ngram) = oldValue + 1
Next

frequencies should now contain a dictionary with all two consecutive word pairs contained in the text, and the frequency with which they appear (as a consecutive pair).

This code requires the ArrayComparer class:

Public Class ArrayComparer(Of T)
    Implements IEqualityComparer(Of T())

    Private ReadOnly comparer As IEqualityComparer(Of T)

    Public Sub New()
        Me.New(EqualityComparer(Of T).Default)
    End Sub

    Public Sub New(ByVal comparer As IEqualityComparer(Of T))
        Me.comparer = comparer
    End Sub

    Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
            Implements IEqualityComparer(Of T()).Equals
        System.Diagnostics.Debug.Assert(a.Length = b.Length)
        For i As Integer = 0 to a.Length - 1
            If Not comparer.Equals(a(i), b(i)) Then Return False
        Next

        Return True
    End Function

    Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
            Implements IEqualityComparer(Of T()).GetHashCode
        Dim hashCode As Integer = 17
        For Each obj As T In arr
            hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
        Next

        Return hashCode
    End Function
End Class

Unfortunately, this code doesn’t compile on Mono because the VB compiler has problems finding the generic EqualityComparer class. I’m therefore unable to test whether the GetHashCode implementationw works as expected but it should be fine.

Thank you Konrad very much for this beginning of an solution!

I tried your code and got the following result:

Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)

frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------

My first question: shouldn't the 3 last key pairs get a value of 2 as they're found two times in the text?

Second: The reason I got into the n-gram approach is that I don't want to limit the word count (n) to a specific length. Is there a way to make a dynamic approach that tries to find the longest phrase match first and then step down to the last wordcount of 2?

My goal result for the sample query above is:

---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------

There is an similar C++ approach for this written by Hatem Mostafa over at codeproject.com: N-gram and Fast Pattern Extraction Algorithm

Sadly I'm no C++ expert and have no idea how to convert this bit of code as it includes a lot of memory handling that .Net doesn't. The only problem with this example is that you have to specify the minimum word pattern length and I want it to be dynamic from 2 to max found.