I'm accessing some website and I need to extract some data. To be more specific - from this part:
<input type="hidden" value="1" name="d520783895194bd08750e47c744d553d">
I need to extract the "name" part. I heard that reular expressions are not the best solution, so开发者_如何学Python I'd like to ask what is the best way to access this piece of data I need.
After parsing a website with NekoHTML or TagSoup (which should take care of the fact that your input field tag is not closed), I suggest to use a xpath expression:
//input[@type='hidden'][@value=1]/@name
In groovy you will apply it in form of GPath.
Use a Html parsing library, they fix malformed Html a make it easy to navigate the document to find and update elements. Here is a link to a list of Java/Groovy implementations:
http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/
Looks like NekoHTML and TagSoup are popular, but I haven't used either or Groovy for that matter. But I have used Html Parsers in other languages.
精彩评论