开发者

How can I avoid a specific string pattern from being replaced by Regex.replace ()

开发者 https://www.devze.com 2022-12-29 15:15 出处:网络
I have a string like Pakistan, officially the <a href=\"Page.aspx?Link=Islamic Republic of Pakistan\">IslamicRepublic of Pakistan</a>

I have a string like

Pakistan, officially the <a href="Page.aspx?Link=Islamic Republic of Pakistan">Islamic Republic of Pakistan</a>

Now I am using

System.Text.RegularExpressions.Regex.Replace(inputText, "(\\bPakistan\\b)", "something"); to replace Pakistan outside the tags. But I don't want to replace Pakistan occurring within the <a></a> tags.

Edit: an actual string

Pakistan (Urdu: پاکِستان), officially the Islamic Republic of Pakistan, is a country in South Asia. It has a 1,046-kilometre (650 mi) coastline along the Arabian Sea and Gulf of Oman in the south and is bordered by Afghanistan and Iran in the west, India in the east and China in the far northeast.[6] Tajikistan also lies very close to Pakistan but is separated by the narrow Wakhan Corridor.

And An array of strings

string[] links={"Pakistan","Islamic Republic","Republic of Pakistan","South Asia","Arabian Sea","Gulf","Oman","Gulf of Oman","the south","in the south","Afghanistan","Iran","the west","in the west","west India","the east","China","Tajikistan","the narrow","Wakhan Corridor","Central Asia","the Middle","Middle East","the Middle East"}

I want to replace every occurrence of every string in this array with <a href="page.aspx?link=thisString">thisString</a>. and I could not correctly add links to string开发者_如何学Pythons like "Republic of Pakistan" where Pakistan is also another string in this array.


If you're trying to do something in the context of HTML syntax, use an HTML parser.


For the first part of your question, I would match either a link or the target word:

Regex r = new Regex(@"<a\s+.*?</a>|\bPakistan\b");

Then I would use a MatchEvaluator to check which one I matched and replace accordingly: if it's a link, plug it back in; if it's the target word, linkify it.

For the second part, you can Join the strings in the array into a regex alternation, like this:

string regex = String.Format(@"\b({0})\b", String.Join("|", links));

Just remember that an alternation returns the first matching alternative, not the longest. If any alternative A is a prefix of alternative B, B should be listed before A. For example, the Middle East should come before the Middle in your list.


Although @Chris solution does not works exactly here, but you can use in this way.

string content = "Pakistan is <a href=\" Pakistan is\">Pakistan an islamic country</a>";
string content2= Regex.Replace(content,@"\bPakistan\b", "India");
string content3 = Regex.Replace(content2, @"(?<=\<\s*a[^<]+)\bIndia\b(?=.*?\>)", "pakistan");        
Console.WriteLine(content3);    

but this is not a very efficient solution.


Here's how you can do the opposite of what you're asking (replace only the instances inside the tags):

content = Regex.Replace(content, @"(?<=\<\s*a[^>]+)\bPakistan\b(?=.*?\>)", "India");

This is very untested and not what you want, but it could give you some hints. This uses zero-width lookaround assertions. I'm sure there are many other ways to do it.

This is really pushing the limits of regex. You should probably use an HTML parser.

Edit: using negative lookbehind, this appears to work (please test it!):

content = Regex.Replace(content, @"(?<!\<\s*a[^>]+)\bPakistan\b", "India");


Get each line of text into a string A

Remove the bit between <a></a> and store it in string B

Run your Regex on the remaining text in string A

return A + B

0

精彩评论

暂无评论...
验证码 换一张
取 消