开发者

RegEx help to remove noise words or stop words from string

开发者 https://www.devze.com 2023-03-22 11:56 出处:网络
I want to remove all noise tags from input tags (a string) The tags are separated by comma. If a noise word is part of a big tag, it will remain.

I want to remove all noise tags from input tags (a string) The tags are separated by comma. If a noise word is part of a big tag, it will remain.

This is what I have but not working:

string input_string = "This,sure,about,all of our, all, values";
string stopWords = "this|is|about|after|all|also";
stopWords = string.Format(@"\s?\b(?:{0})\b\s?", stopWords);
string tags = Regex.Replace(input_string, stopWords, "", RegexOptions.IgnoreCase); 

This is what I want from above input: ",sure,,all of our,,values"

These words "This", "about", "all" will be replaced with "" since they are noise words. But "all of our" will remain even if it has the noise word "all" in it. This is because comma is the tag boundary

Anyone can give me a helping hand?

I had an alternate solution that puts the noise 开发者_运维问答words into a dictionary and then search each word in input string. But I prefer RegEx approach.


        var input = "This,sure,about,all of our, all, values";
        var stopWords = new Regex("^(this|is|about|after|all|also)$");
        var result = String.Join(",", input.Split(',').
            Where(x => !stopWords.IsMatch(x.Trim())));


Try

stopWords = string.Format(@"(?<=^|,)\s*(?:{0})\s*(?=$|,)", stopWords);

This uses lookbehind (?<=) to determine a preceding , or start of string and lookahead (?=) to determine a trailing ,. I've also dropped the word boundry \b code because its not needed and replaced your optional whitespace \s? with \s* to match 0 or more whitespaces.

You could change the * back to a ? if you really do mean at most one space.


I don't like using Regex for processing tasks so I will offer an alternative solution and you can decide if you want to use it or not.

string[] inputWords = input_string.Split(',');
string tags = "";

foreach(string s in inputWords)
{
   if(!storWords.Contains(s.ToLowerInvariant()))
      tags += s + ",";
}

tags = tags.TrimEnd(',');

//tags = "sure,all of our,values"
0

精彩评论

暂无评论...
验证码 换一张
取 消