I am trying to scrape some data from a page with a table based layout. So, to get some of the data I need to get something like 3rd table inside 2nd table inside 5th table inside 1st table inside body. I am trying to use enlive, but cannot figure out how to use nth-of-type and other 开发者_运维问答selector steps. To make matters worse, the page in question has a single top level table inside the body, but (select data [:body :> :table]) returns 6 results for some reason. What the hell am I doing wrong?
For nth-of-type
, does the following example help?
user> (require '[net.cgrand.enlive-html :as html])
user> (def test-html
"<html><head></head><body><p>first</p><p>second</p><p>third</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["second"]})
No idea about the second issue. Your approach seems to work with a naive test:
user> (def test-html "<html><head></head><body><div><p>in div</p></div><p>not in div</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html)) [:body :> :p])
({:tag :p, :attrs nil, :content ["not in div"]})
Any chance of looking at your actual HTML?
Update: (in response to the comment)
Here's another example where "the second <p>
inside the <div>
inside the second <div>
inside whatever" is returned:
user> (def test-html "<html><head></head><body><div><p>this is not the one</p><p>nor this</p><div><p>or for that matter this</p><p>skip this one too</p></div></div><span><p>definitely not this one</p></span><div><p>not this one</p><p>not this one either</p><div><p>not this one, but almost</p><p>this one</p></div></div><p>certainly not this one</p></body></html>")
#'user/test-html
user> (html/select (html/html-resource (java.io.StringReader. test-html))
[[:div (html/nth-of-type 2)] :> :div :> [:p (html/nth-of-type 2)]])
({:tag :p, :attrs nil, :content ["this one"]})
精彩评论