开发者

RegEx to extract all HTML tag attributes including inline JavaScript

开发者 https://www.devze.com 2022-12-22 09:58 出处:网络
I found this useful regex code here while looking to parse HTML tag attributes: (\\S+)=[\"\']?((?:.(?![\"\']?\\s+(?:\\S+)=|[>\"\']))+.)[\"\']?

I found this useful regex code here while looking to parse HTML tag attributes:

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?

It works great, but it's missing one key element that I need. Some attributes are event triggers that have inline Javascript code in them like this:

onclick="doSomething(this, 'foo', 'bar');return false;"

Or:

onclick='doSomething(this, "foo", "bar");return false;'

I can't figure out how to get the original expression to not count the quotes from the JS (single or double) while it's nested inside the set of quotes that contain the attribute's value.

I SHOULD add that this is not being used to parse an entire HTML document. It's used as an argument in an older "array to select menu" function that I've updated. One of the arguments is a tag that can append extra HTML attributes to the form element.

I've made an improved function and am deprecating the old... but in case somewhere in the code is a call to the old function, I need it to parse these into the new array format. Example:

// Old Function
function create_form_element($array, $type, $selected="", $append_att="") { ... }
// Old Call
create_form_element($array, SELECT, $selected_value, "onchange=\"something(this, '444');\"");

The new version takes an array of attr => value pairs to create extra tags.

create_select($array, $selected_value, array('style' => 'width:250px;', 'onchange' => "doSomething('foo', 'bar')"));

This is merely a backwards compatibility issue where all calls to the OLD function are routed to the new one, but the $append_att argument in the old function need开发者_开发问答s to be made into an array for the new one, hence my need to use regex to parse small HTML snippets. If there is a better, light-weight way to accomplish this, I'm open to suggestions.


The problem with your regular expression is that it tries to handle both single and double quotes at the same time. It doesn't support attribute values that contain the other quote. This regex will work better:

(\w+)=("[^<>"]*"|'[^<>']*'|\w+)


following regex will work as per HTML syntax specs available here

http://www.w3.org/TR/html-markup/syntax.html

regex patterns

// valid tag names
$tagname = '[0-9a-zA-Z]+';
// valid attribute names
$attr = "[^\s\\x00\"'>/=\pC]+";
// valid unquoted attribute values
$uqval = "[^\s\"'=><`]*";
// valid single-quoted attribute values
$sqval = "[^'\\x00\pC]*";
// valid double-quoted attribute values
$dqval = "[^\"\\x00\pC]*";
// valid attribute-value pairs
$attrval = "(?:\s+$attr\s*=\s*\"$dqval\")|(?:\s+$attr\s*=\s*'$sqval')|(?:\s+$attr\s*=\s*$uqval)|(?:\s+$attr)"; 

and the final regex query will be

    // start tags + all attr formats
    $patt[] = "<(?'starttags'$tagname)(?'tagattrs'($attrval)*)\s*(?'voidtags'[/]?)>";

    // end tags
    $patt[] = "</(?'endtags'$tagname)\s*>"; // end tag

    // full regex pcre pattern
    $patt = implode("|", $patt);
    // search and match
    preg_match_all("#$patt#imuUs",$data,$matches);

hope this helps.


Even better would be to use backreferences, in PHP the regular expression would be:

([a-zA-Z_:][-a-zA-Z0-9_:.]+)=(["'])(.*?)\\2

Where \\2 is a reference to (["'])

Also this regular expression will match attributes containing _, - and :, which are allowed according to W3C, however, this expression wont match attributes which values are not contained in quotes.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号