开发者

Regular expression to replace quotation marks in HTML tags only

开发者 https://www.devze.com 2022-12-23 13:05 出处:网络
I have the following string: <div id=\"mydiv\">This is a \"div\" with quotation marks</div>

I have the following string:

<div id="mydiv">This is a "div" with quotation marks</div>

I want to use regular expressions to return the following:

<div id='mydiv'>This is a "div" with quotation marks</div>

Notice how the id attribute in the div is now surrounded by apostrophes?

How can I do t开发者_如何学Pythonhis with a regular expression?

Edit: I'm not looking for a magic bullet to handle every edge case in every situation. We should all be weary of using regex to parse HTML but, in this particular case and for my particular need, regex IS the solution...I just need a bit of help getting the right expression.

Edit #2: Jens helped to find a solution for me but anyone randomly coming to this page should think long and very hard about using this solution. In my case it works because I am very confident of the type of strings that I'll be dealing with. I know the dangers and the risks and make sure you do to. If you're not sure if you know then it probably indicates that you don't know and shouldn't use this method. You've been warned.


This could be done in the following way: I think you want to replace every instance of ", that is between a < and a > with '.

So, you look for each " in your file, look behind for a <, and ahead for a >. The regex looks like:

(?<=\<[^<>]*)"(?=[^><]*\>)

You can replace the found characters to your liking, maybe using Regex.Replace.

Note: While I found the Stack Overflow community most friendly and helpful, these Regex/HTML questions are responded with a little too much anger, in my opinion. After all, this question here does not ask "What regex matches all valid HTML, and does not match anything else."


I see you're aware of the dangers of using Regex to do these kinds of replacements. I've added the following answer for those in search of a method that is a lot more 'stable' if you want to have a solution that will keep working as the input docs change.

Using the HTML Agility Pack (project page, nuget), this does the trick:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here"); 
// or doc.Load(stream);

var nodes = doc.DocumentNode.DescendantNodes();

foreach (var node in nodes)
{
    foreach (var att in node.Attributes)
    {
         att.QuoteType = AttributeValueQuote.SingleQuote;
    }
}

var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);


You can match:

(<div.*?id=)"(.*?)"(.*?>)

and replace this with:

$1'$2'$3
0

精彩评论

暂无评论...
验证码 换一张
取 消