开发者

Parsing Dreamweaver templates with Regular Expressions

开发者 https://www.devze.com 2022-12-08 19:43 出处:网络
I have a requirement to parse the content out of Dreamweaver templates. I\'m using C#. Here is some example content that I will need to parse.

I have a requirement to parse the content out of Dreamweaver templates. I'm using C#.

Here is some example content that I will need to parse.

<div id="myDiv">
    <h1><!-- InstanceBeginEditable name="PageHeading" -->
    The Heading<!-- InstanceEndEditable --></h1>
    <!-- InstanceBeginEditable name="PageContent" -->
    <开发者_JS百科;p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed nibh turpis, 
    sagittis vitae convallis at, fringilla nec augue.</p>
    <p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    Sed nibh turpis, sagittis vitae convallis at, fringilla nec augue.</p>
    <!-- InstanceEndEditable -->
</div><!-- END #myDiv-->

Dreamweaver templates are based around HTML comments with specific strings denoting their purpose. They key ones for me are as follows, as they denote the start and end of editable regions in the page.

<!-- InstanceBeginEditable name="xxxxxx" -->
<!-- InstanceEndEditable --> 

As you can see from my example HTML, there may be other comments in the source code.

So starting simple, I have the following, which matches all the opening Editable region tags.

<!-- InstanceBeginEditable(.*)?--> 

So next I want to get everything between there and the next "

<!-- InstanceBeginEditable(.*)?-->(?<content>(.*)?)<!-- InstanceEnd

Can you tell me why this is so. I would have thought a non-greedy capture (.*)? in-between my already working code and the literal

<!—InstanceEnd

would have matched what I need...


You don't want to put parentheses around .*.

This means to grab everything greedily, or not.

(.*)?

This means to grab everything lazily:

.*?

Also, in your regex, you have only one - in the ending token. Change it to this:

<!-- InstanceBeginEditable.*?-->(?<content>.*?)<!-- InstanceEnd

By the way, it's dangerous to have two .*s in a regex without an atomic group. On unexpected data, you can get catastrophic backtracking. I'd recommend changing the first .*? to [^-]*. And, while I'm at it, I'll suggest you handle whitespace more forgivingly:

<!--\s*InstanceBeginEditable[^-]*-->(?<content>.*?)<!--\s*InstanceEnd

You probably already know this, but let me add that with .NET, you'll need to use RegexOptions.Singleline.


Use the HTML Agility Pack, see my answer here, How do I parse HTML using regular expressions in C#?

0

精彩评论

暂无评论...
验证码 换一张
取 消