I have the following text
hello <?tag?> world <?tag2?> xx <?/tag2?> hello <?/tag?> world
And I need it converted into
array( 'hello ', array( ' world ', array( ' xx ' ), ' hello ' ), ' world' );
Tags are alpha-numeric, as long as they are closed with the matching tag, or <?/?>
. Tags with same name may repeat, but wouldn't be inside each-other.
My question is which would be the most CPU-efficient way to go?
- use recursive preg_replace with callback
- use preg_match_all with PREG_OFFSET_CAPTURE
- use 开发者_如何学Gopreg_split to flattern all tags (PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE), into linear array then walk through and group tags.
If you can also provide the expression, I would be really happy.
This turned out not so straightforward but hopefully this could be helpful to others. The biggest complication was returning non-string from callback function of preg_replace.
Thanks all who tried to help!
class Parser {
public $ret=array();
function loadTemplateFromString($str){
/* First expand self-closing tags <?$tag?> -> <?tag?><?/tag?> */
/* Next fix short ending tag <?tag?> <?/?> -> <?tag?> <?/?> */
return preg_replace('/(.*<\?([^\/][\w]+)\?>)(.*?)(<\?\/?\?>)/',
/* Finally recursively build tag structure */
function recursiveReplace($x){
// Called recursively
return '';
$p=new Parser();
$p->loadTemplateFromString('bla <?name?> name <?/name?> bla bla <?$surname?> bla '.
'<?middle?> mm <?/?> blah <?outer?> you <?inner?> are <?/?> inside <?/outer?>'.
' bobobo');
This outputs:
0 => string 'bla ' (length=4)
1 =>
0 => string ' name ' (length=6)
2 => string ' bla bla ' (length=9)
3 =>
0 => string '' (length=0)
4 => string ' bla ' (length=5)
5 =>
0 => string ' mm ' (length=4)
6 => string ' blah ' (length=6)
7 =>
0 => string ' you ' (length=5)
1 =>
0 => string ' are ' (length=5)
2 => string ' inside ' (length=8)
8 => string ' bobobo' (length=7)
How about converting <?tag
to <elem
and the parsing it as XML?
After you get a raw structure looking like the result you mentioned, you could/would verify it against your element structure (that is, ensure items are numerically inside each other etc).
Just add in a document element and you are set with this stylesheet:
Edit: After the fact that these tags are mixed with HTML came up, I thought I'd change my strategy. Please check out the following code first before a description:
$data = '<b>H</b>ello <?tag?> <b>W</b>orld <?/tag?>';
$conv1 = array(
// original => entity
'<?tag' => '%START-BEGIN%',
'<?/tag' => '%START-END%'
'?>' => '%END-END%'
$conv2 = array(
// entity => xml
'%START-BEGIN%' => '<element',
'%START-END%' => '</element'
'%END-END%' => '>'
$data = str_replace(array_keys($conv1), array_values($conv1), data);
$data = htmlentities($data, ENT_QUOTES); // encode HTML characters
$data = str_replace(array_values($conv2), array_keys($conv2), data);
$xml = '<?xml version="1.0" encoding="UTF-8"?>'.$data;
// You must apply the following function to each output text
// html_entity_decode($data,ENT_QUOTES);