I have input string which has strings like:
<image id="1234" caption="text1" alt="text2">
...blah blah...
There can be multiple instances of such strings in the input.
I want to retrieve the attributes(caption, alt, etc) of such string alongwith the id and then print the id, alt, caption etc. There can be images without any attributes and just id.
开发者_JS百科Please advise.
First things first: Don't parse xml or [x]html with regex, this is generally considered not to be a good approach.
But I understand that for quick+dirty applications, you don't want to deal with 3rd party libraries. But you have to consider the following questions, which make regex an even worse approach:
- Is your xml valid or does it contain "broken" tags?
- Are the attributes always in the same order? Or does
caption
sometimes occur beforealt
at any chance? - You already stated that some
image
tags only contain the id tag
These (and more) aspects determine the complexity of your regex solution. You need a double loop in order to get all the required data.
- Find all the image tags:
(<image[^>]+)>
(this assumes there are no>
characters in the attribute values) - Then, inside the
image
tags you found, use this:[ ]+([a-zA-Z0-9]+)="([^"]*)"
I hope you already see that this is quite messy and does not cover all the cases of valid xml!
A xml parser is always the correct way to go.
精彩评论