开发者

optimize regex which matches two html tags

开发者 https://www.devze.com 2023-01-10 09:54 出处:网络
((<(\\\\s*?)(object|OBJECT|EMBED|embed))+(.*?)+((object|OBJECT|EMBED|embed)(\\\\s*?)>))
((<(\\s*?)(object|OBJECT|EMBED|embed))+(.*?)+((object|OBJECT|EMBED|embed)(\\s*?)>))

I need to get object and embed tags fro开发者_如何学Cm some html files stored locally on disk. I've come up with the above regex to match the tags in java then use matcher.group(1); to get the entire tag and its contents

Can anyone perhaps improve this? Is there anything that stands out immediately to you that i should change?

It does work BTW, just wanting an input to see if it can be better because i'm fairly new to regex myself.


Yes, here's the improvement:

  1. Download a fullworthy Java HTML parser like Jsoup and put it in classpath.

  2. Now you can select all <object> and <embed> elements as follows:

    Document document = Jsoup.parse(new File("/path/to/file.html"), "UTF-8");
    Elements elements = document.select("object,embed");
    for (Element element : elements) {
        System.out.println(element.outerHtml());
    }
    

See also:

  • Regular Expressions - Now you have two problems
  • Parsing HTML - The Cthulhu way
  • Pros and cons of HTML parsers in Java
0

精彩评论

暂无评论...
验证码 换一张
取 消