HTML data extraction_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-17 04:44 出处：网络

I\'m accessing some website and I need to extract some data. To be more specific - from this part: <input type=\"hidden\" value=\"1\" name=\"d520783895194bd08750e47c744d553d\">

I'm accessing some website and I need to extract some data. To be more specific - from this part:

<input type="hidden" value="1" name="d520783895194bd08750e47c744d553d">

I need to extract the "name" part. I heard that reular expressions are not the best solution, so开发者_如何学Python I'd like to ask what is the best way to access this piece of data I need.

After parsing a website with NekoHTML or TagSoup (which should take care of the fact that your input field tag is not closed), I suggest to use a xpath expression:

//input[@type='hidden'][@value=1]/@name

In groovy you will apply it in form of GPath.

Use a Html parsing library, they fix malformed Html a make it easy to navigate the document to find and update elements. Here is a link to a list of Java/Groovy implementations:

http://www.wavyx.net/2009/01/13/looking-for-a-java-html-parser-or-groovy/

Looks like NekoHTML and TagSoup are popular, but I haven't used either or Groovy for that matter. But I have used Html Parsers in other languages.