开发者

Parse Phrases into different word pairings

开发者 https://www.devze.com 2023-01-22 21:07 出处:网络
I am trying to figure out what the best method would be for me to use to parse word phrases passed to me and build different groupings based on those phrases.

I am trying to figure out what the best method would be for me to use to parse word phrases passed to me and build different groupings based on those phrases.

Example XML:

<root>
   <keyword value=""My First Phrase""/>
   <keyword value=""My First Phrase Again""/>
   <keyword value=""My First Phrase Again and Again""/>
</root>

So I would extract these out of the xml:

My First Phrase
My First Phrase Again
My First Phrase Again and Again

I would then like to build these new phrases from the original:

My First Phrase   
My First
First Phrase
My
First
Phrase

My First Phrase Again
My First Phrase
First Phrase Again
My First
First Phrase
Phrase Again
My
First
Phrase
Again

This would let me break down the phrases and build a sort of ranking out of those开发者_StackOverflow社区 words. I am have built some lists and iterated over them, but it isn't work the way I would expect.

So for the ranking I mean this:

My First Phrase Again    Rank: 1 (Exact Match)
My First Phrase          Rank: 2
First Phrase Again       Rank: 2
My First                 Rank: 3
First Phrase             Rank: 3
Phrase Again             Rank: 3
My                       Rank: 4
First                    Rank: 4
Phrase                   Rank: 4
Again                    Rank: 4

Not sure what the best approach would be to parse this data.

Thanks,

S


It sounds like you're looking at developing a grammar. Your rankings look like they're the same as the depth of their tokens in a parse tree. Your terminal symbols would be any word, and your start symbols would be the sentences listed in your root element.

For instance:

S -> X Y
X -> M F
Y -> P A
M -> "My"
F -> "First"
P -> "Phrase"
A -> "Again"

In this instance, the depth of "My First Phrase Again" would be 0 in the parse tree, the depth of "My First" and "Phrase Again" would be 1, and the depth of "My", "First", "Phrase", and "Again" would be 2.

I would start looking around for grammar parsers. There are a lot of these available since they're used in writing compilers. Alternatively you could try and write your own. Context-free grammars are fairly simple to implement; All you really need is a stack and a way to interpret and operate on your grammar rules. There is a lot of literature on this since it is a well-studied area of computer science.


You need a suffix array, but rather than separating by character, separate by the " " token. http://en.wikipedia.org/wiki/Suffix_array

There is a good description of this in Programming Pearls.


If i understand correctly your definition of 'rank', you can solve it with something like this:

public class PhraseRanking : IEnumerable<KeyValuePair<string, int>>
{
    private readonly Dictionary<string, int> _ranking;

    public PhraseRanking()
    {
        _ranking = new Dictionary<string, int>();
    }

    public PhraseRanking(string phrase)
        : this()
    {
        var words = phrase.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
        var sb = new StringBuilder(phrase.Length);
        for(int i = words.Length; i > 0; --i)
        {
            int rank = words.Length - i + 1;
            int lastFirstWordIndex = words.Length - i;
            for(int j = 0; j <= lastFirstWordIndex; ++j)
            {
                sb.Clear();
                int lastWordIndex = j + i - 1;
                for(int k = j; k <= lastWordIndex; ++k)
                {
                    sb.Append(words[k]);
                    if(k != lastWordIndex) sb.Append(' ');
                }
                _ranking[sb.ToString()] = rank;
            }
        }
    }

    public int this[string phrase]
    {
        get { return _ranking[phrase]; }
    }

    public int Count
    {
        get { return _ranking.Count; }
    }

    public IEnumerator<KeyValuePair<string, int>> GetEnumerator()
    {
        return _ranking.GetEnumerator();
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return _ranking.GetEnumerator();
    }
}

Usage:

var ranking = new PhraseRanking("My First Phrase Again");
var sb = new StringBuilder();
foreach(var rank in ranking)
{
    sb.AppendLine(rank.Value.ToString() + ": " + rank.Key);
}
MessageBox.Show(sb.ToString());

Output:

1: My First Phrase Again
2: My First Phrase
2: First Phrase Again
3: My First
3: First Phrase
3: Phrase Again
4: My
4: First
4: Phrase
4: Again
0

精彩评论

暂无评论...
验证码 换一张
取 消