Let's say I have a set of keywords in an array {"olympics", "sports tennis best", "tennis", "tennis rules"}
I then have a large list (up to 50 at a time) of strings (or actually tweets), so they are a max of 140 characters.
I want to look at each string and see what keywords are present there. In the case where a keyword is composed of multiple words like "sports tennis best", the words don't have to be together in the string, but all of them have to show up.
I've having trouble figuring out an algorithm that does this efficiently.
Do you guys have suggestions on a way to do this? Thanks!
Edit: To explain a bit better each keyword has an id associated with it, so {1:"olympics", 2:"sports tennis best", 3:"tennis", 4:"tennis rules"}
I want to go through the list of strings/tweets and see which group of keywords match. The output should be, this tweet belongs with keyword #4. (multiple matches may be made, so anything that matches keyword 2, would also match 3 -since they both contain tennis).
When there are multiple words in the keyword, e.g. "sports tennis best" they don't have to appear together but have to all appear. e.g. this will correctly match: "i just played tennis, i开发者_运维技巧 love sports, its the best"... since this string contains "sports tennis best" it will match and be associated with the keywordID (which is 2 for this example).
Edit 2: Case insensitive.
IEnumerable<string> tweets, keywords;
var x = tweets.Select(t => new
{
Tweet = t,
Keywords = keywords.Where(k => k.Split(' ')
.All(t.Contains))
.ToArray()
});
Multiple patterns can be searched very efficiently using several algorithms such as the algorithm of Aho-Corasick (using a trie) or the one from Wu and Manber.
If performance is critical, I suggest taking either of those. To search in multiple strings, it may be most efficient to concatenate all your 50 strings into one larger string, keeping book of the starting positions of individual strings.
Maybe something like this?
string[] keywords = new string[] {"olympics", "sports tennis best", "tennis", "tennis rules"};
string testString = "I like sports and the olympics and think tennis is best.";
string[] usedKeywords = keywords.Where(keyword => keyword.Split(' ').All(s => testString.Contains(s))).ToArray();
Whoops.
foreach (var s in strings)
{
foreach (var keywordList in keywordSet)
{
if (s.ContainsAll(keywordList))
{
// hit!
}
}
}
...
private bool ContainsAll(this string s, string keywordList)
{
foreach (var singleWord in keywordList.Split(' '))
{
if (!s.Contains(singleWord)) return false;
}
return true;
}
There are ways to pre-process strings to make searches more effective, but I think that the overhead is more than the gain for such short strings. It's not that much data, so I would just loop through the strings:
foreach (string tweet in tweets) {
foreach (string keywords in theArray) {[
string[] keyword = keywords.Split(' ');
bool found = true;
foreach (string word in keyword) {
if (tweet.indexOf(word) == -1) {
found = false;
break;
}
}
if (found) {
// all words exist in the tweet
}
}
}
I would suggest putting all of your keywords into a list of strings then going over your list of data (tweets, whatever) as another list of strings.
Do something like this:
Dim matchingStrings As Dictonary(String, String);
For Each stringToSearch As String In tweetList
For Each keyword As String In keywordList
If stringToSearch.Contains(keyword)
matchingString.Add(stringToSearch, keyword);
break; End IF End For End For
Then MatchingString Contains all of your matches
EDIT: In C# And per your multiple words in a keyword list
Dictionary<string, string> matchingString = New Dictionary<string, string>;
foreach (String stringToSearch In tweetList){
foreach (String keyword In keywordList){
If(stringToSearch.Contains(keyword){
matchingString.Add(stringToSearch, keyword);
break;
}
else if{
List<string> split = keyword.Split(" ")
foreach(String sKeyword In split){
If(stringToSearch.Contains(keyword){
matchingString.Add(stringToSearch, keyword);
break;
}
}
}
} }
精彩评论