Assuming I am using recursive loop for resilient discovery and location of DOM element(s) that will work across semi-structured and semi-uniform HTML DOM documents from a website.
For example, when crawling links on a website and coming across small variations in it's xpath location. Resilience is desired to allow flexible un-interrupted crawling.
1)
I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)
2)
It's distinguishable since it appears to be inside a table and pargraph or contain开发者_StackOverflower.
3)
There can be an acceptable level of unexpected parents or children before this desired link mentioned in1)
but I don't know what. More unexpected elements would mean departure from1)
.
4)
Identifying via element's id and class or any other unique attribute value is not desired.
I think the following xpath should sum up:
/`/p/table/tr/td/a`
on some pages there is variations to the xpath but it still qualifies as 1) desired link
//p/div/table/tr/td/a
or //p/div/span/span/table/tr/td/b/a
I have used indentation to mimic each loop iteration (
(should I use plurral or singular ? children vs child. parents vs parent. I think singular makes sense as the immediate parent or child is of concern here.)
TOP DOWN SEARCHING:
how many p's are there ?
how many these p's have table as child ? If none, search next sub level.
how many these table's have tr as child ? If none, search next sub level.
how many these tr have td as child ? If none, search next sub level.
how many these td have a as child ?
BOTTOM UP SEARCHING:
how many a's are there ?
how many of these a's have td as parent ? If none, look up to the next super level.
how many of these td have tr as parent ? If none, look up to the next super level.
how many of these tr have table as parent ? If none, look up to the next super level.
how many of these table have p as a parent ? If none, look up to the next super level.
Does it matter if it's top down or bottom up ? I feel that top down is useless and inefficient if it turns by the end of the loop, the desired anchor link is not found.
I think I would also measure how many unexpected parents or children were discovered in each iteration of the loop and would compare to a preset constant that I am comfortable with ex) say no more than 2. If there are 3 or more unexpected parents or children iterations before the discovery of my desired anchor link, I would assume it's not what I am looking for.
Is this the correct approach ? This is just something that I came up with on top of my head. I apologize if this problem is not clear, I have tried my best. I would love to get some input on this algorithm.
Seems that you want something like:
//p//table//a
If you have limitations for the number of intermediate elements in the path, say not more than 2, then the above would be modified to:
//p[not(ancestor::*[3])]
//table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
/tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]
This selects all a
elements whose parent or grand-parent is td
, whose parent is a tr
, whose parent is a table
, whose parent or grandparent is a p
that has less than 3 ancesstor - elements.
精彩评论