开发者

Need help with regex to extract data inside tags

开发者 https://www.devze.com 2023-02-24 04:26 出处:网络
I have been struggling to create a regex suiting my need for the HTML below for some time. I´m using the java.util.regex.* package, and for different reasons I need to use this package rather than an

I have been struggling to create a regex suiting my need for the HTML below for some time. I´m using the java.util.regex.* package, and for different reasons I need to use this package rather than any third party lib.

What I want is to extract the data inside the tags, so the data I want in this particular HTML is 25 / 25, Lindhagen, 0, Spinninghall, 35 and Test Person.

Is it possible to create a regex for this?

<div id="rsv_detail">
  <hr />

  <label>Bokningsstatus</label>
  <span>&nbsp;</span>

  <label>Bokningar</label>

  <span>25 / 25 &nbsp;</span>

  <br />

  <label>Plats</label>
  <span>Lindhagen&nbsp;</span>

  <label>Anlänt</label>
  <span>0&nbsp;</span>

  <br />

  <label>Sal</label>
  <span>Spinninghall&nbsp;</span>

  <label>Max antal</label>
  <span>35&nbsp;</span>
  <br />

  <label>Ledare</label>

  <span>Test Person&nbsp;</span>
  <br /><br />


  <label>Visa mer</label>
  <span>      
    <a href="/index.php?instructors%5B%5D=X129518&amp;func=la&amp;tak=0.36507500+1302460619">Ledare</a>
    <a href="/index.php?locations=LI&amp;func=la&amp;tak=0.36507500+1302460619">Plats</a>
    <a href="/index.php?activities=SP_MEDEL&amp;func=la&amp;tak=0.36507500+1302460619">Aktivitet</a>

  </span>
  <br /><br />

  <br /&g开发者_如何学Pythont;
  <br />
  <hr />
</div>


As far as I know, the best way to extract information from HTML is to use an HTML parser or to convert the HTML to XHTML and extract it via standard XML techniques. Why can't you use 3rd party libraries?


Pattern p = Pattern.compile("<span>([^<&]+)&nbsp;</span>");
Matcher m = p.matcher(text);
while (m.find())
{
  System.out.println(m.group(1));
}

output:

25 / 25
Lindhagen
0
Spinninghall
35
Test Person

This assumes the target <span> always ends with &nbsp;, and never contains any other entities or elements.


If you filter out each line which doesn't open and close the span-tag in the same line, you can use:

filtered.replaceAll ("<span>([^<]*)</span>", "$1")
  .replaceAll ("&nbsp;", "")

The paranteheses build a capturing group, which you later reference from left to right by the first paren by number - here it is just one, hence $1. After the opening tag, you read everything except ^ a less-than sign, which you expect to be the closing tag, until the closing tag.

However, in most cases I would agree with stema and Hovercraft Full Of Eels. Pitfalls for regex in html are:

  • Open and close tag are hard to find with regex, if they span over multiple lines, and more so, if they are nested.
  • Tags inside Comments are hard to detect

However there are rare cases, where regexes are useful:

  • One time jobs, where you oversee all coming input.
  • Generated HTML, which will always look the same, from routers for example, or javadocs
  • HTML which you build yourself with your program in mind


'<span>(.*?)&amp;</span>' as a RE will do, won't it ?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号