开发者

Reasons why SeleniumRC CSS locators might be slower than XPath?

开发者 https://www.devze.com 2023-02-25 03:37 出处:网络
I\'ve got some code that does a simulated recursion tree walk to scrape stuff from an HTML tree using SeleniumRC. I\'ve run the code using both Xpath and CSS locators.

I've got some code that does a simulated recursion tree walk to scrape stuff from an HTML tree using SeleniumRC. I've run the code using both Xpath and CSS locators.

The tree is represented as a series of nested tables. If it matters at all, some of the tree content starts out not visible as branches are "collapsed". For both Xpath and CSS, the tree is in the same state in terms of visible vs. not visible.

To get node values, my code starts with a "root" expression, adds "branch" tokens that can be incremented for each successive sibling node, and then uses a "node" token to get the text content.

It all works, but much slower using the CSS expressions I've come up with.

I suppose it is a kludgy way to make locator expressions, although it works for my purposes. I'm just trying to figure out how to best use CSS to get closer to the times involved using Xpath.

The loop tests many invalid expressions (keeps looking for nth sibling until not found) and the expressions get really long, due to the way I am incrementally drilling further and further into nested tables.

Below follows the bits of expression and examples that come from the recursion. If anyone can provide some insight as to what I am doing 开发者_JS百科that is making CSS take so much longer than Xpath, that would be very helpful.

I am a total newb at doing this kind of manipulation of HTML content, if you see something dumb in terms of how I've moved from Xpath to CSS, please say so.

XPath “tokens”:

final String rootbase = "//*[contains(@id,\"treeBox\")]/div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = "/table/tbody/tr[{branchIncrement}]/td[2]";
final String nodetoken = "/table/tbody/tr/td[4]/span";

CSS “tokens”:

final String rootbase = "css=[id*=treeBox]>div";
// in next string, "{branchIncrement}" will be replaced with integer values from 2 to get to text content, and skip graphical elements
final String leveltoken = ">table>tbody>tr:nth-child({branchIncrement})>td:nth-child(2)";
final String nodetoken = ">table>tbody>tr>td:nth-child(4)>span";

The first XPath expression for the content at the "root" is:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span

The last XPath expression for a 40 node tree with four levels, three sibling each level below the root (1+3+3x3+3x3x3) is:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[4]/span

The first CSS expression is:

[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

The last CSS expression is:

[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(3)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span


In Firefox, Selenium RC's XPath locators are processed by the browser's native XPath engine, which the CSS locators are processed by a JavaScript library (Dean Edwards' cssQuery.js). Later Selenium releases (e.g., the 2.0b* series) use jQuery's sizzle library for CSS, but they still do it in JavaScript. On top of that implied difference in speed, you're doing pattern-matching in the root expression (i.e., [id*=treeBox), which requires enumerating the entire DOM tree to locate the matches, even before you descend down from there. Think about how you'd write that in pure JavaScript and you'll start to see the problem.

If it makes you feel any better, IE still doesn't have a native XPath implementation, so Selenium uses one of several JavaScript implementations in that browser, and it's anywhere from one-half to one-tenth the speed of XPath in Firefox 3.6 because of that.

Long answer short, there's not much you can do to make CSS locators faster in this particular case.


Usually, it's not something you can help. The XPath selector mechanism in Selenium makes use of the browser's XPath tools. Even IE6 has one of those. I'm not aware of a browser that provides CSS selector tools through JavaScript, so Selenium has to use its own code. As their code is all JavaScript and internal browser XPath parsing is usually done in native code, it's much slower (especially in IE6).


Thanks for that feedback. After reading your note, I wondered if I could get substancial improvment by using a tiny bit of code to resolve a literal Id value to replace the contains expression used repeatedly.

Here are four different locators I've used for the same thing. A pair of the locators are XPath, and two are CSS. For each of those pairs, one uses a contains expression, and one resolves to a literal first. In each case, the example locator are for the last node of a three level 1307 node tree.

XPath with contains:

//*[contains(@id,"treeBox")]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[26]/td[2]/table/tbody/tr/td[4]/span

XPath where literal replaces contains expression:

id('ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox')/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td[2]/table/tbody/tr[24]/td[2]/table/tbody/tr/td[4]/span

CSS with contains:

css=[id*=treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

CSS where literal replaces contains expression:

css=[id=ns_7_5R4GAB1A0GKQ50IQJQR7VV10M6__treeBox]>div>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(2)>td:nth-child(2)>table>tbody>tr:nth-child(24)>td:nth-child(2)>table>tbody>tr>td:nth-child(4)>span

Working with two different sized trees, one 102 nodes, the other 1307 nodes, I found the following.

102 nodes:
| contains | literal |
XPath | 15 sec. | 13 sec. |
CSS | 19 sec. | 19 sec. |

1307 nodes:
| contains | literal |
XPath | 255 sec. | 145 sec.|
CSS | 1893 sec. | 1811 sec.|

Clearly, a native implementation (XPath on Firefox with Se-RC) is much faster than a JScript implementation. The trade off is that it might not work as well across browsers.

0

精彩评论

暂无评论...
验证码 换一张
取 消