开发者

Regular Expressions (HTML parsing on iPhone)

开发者 https://www.devze.com 2023-01-21 13:09 出处:网络
I am trying to pull data from a website using objective-c.This is all very new to me, so I\'ve done some research.What I know now is that I need to use xpath, and I have another wrapper for that calle

I am trying to pull data from a website using objective-c. This is all very new to me, so I've done some research. What I know now is that I need to use xpath, and I have another wrapper for that called hpple for the iPhone. I've got it up and running in my project.

I am confused about the way I retrieve information from the site. Apparently I am to use regular expressions in this line of code:

NSArray * a = [doc search:@"//a[@class='sponsor']"];

This is just an example. Is that 开发者_开发技巧stuff in the search:@"...." the regular expression? If so, I guess I can develop the hundreds of patterns that I will need for my program to parse the site (I need a lot of data), but is there a better way? I'm very lost in this. Any help is appreciated.


The parameter is an XPath, not a regular expression. Here's a breakdown:

  • All xpaths are interpreted relative to a context node. In this case, it's the root node.
  • // is an abbreviation meaning "all descendents"
  • a means "all child nodes with a node type of 'a'" (in HTML, that's anchors)
  • [...] contains a predicate, refining just which a to match
    • @ is an abbreviation for attribute nodes
    • @class means an attribute named "class"
    • @class='sponsor' means a class attribute equal to "sponsor". Note this will not match nodes with a class containing "sponsor", such as <a class="big sponsor" ...>; the class must be equal.

All together, we have "'a' nodes descending from the root that have class equal to 'sponsor'".


That is an XPath expression, not a regular expression. The W3C has an XPath reference here: http://www.w3.org/TR/xpath/. Basically you are searching for <a> elements with the class "sponsor".

Note that this is a good thing! Regular expressions are bad for parsing HTML.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号