I have continually had problems with Html Agility Pack; my XPath queries only ever work when they are extremely simple:
//*[@id='some_id']
or
//input
However, anytime they get more complicated, then Html Agility Pack can't handle it. Here's an example demonstrating the problem, I'm using WebDriver to navigate to Google, and return the page source, which is passed to Html Agility Pack, and both WebDriver and HtmlAgilityPack attempt to locate the element/node (C#):
//The XPath query
const string xpath = "//form//tr[1]/td[1]//input[@name='q']";
//Navigate to Google and get page source
var driver = new FirefoxDriver(new FirefoxProfile()) { Url = "http://www.google.com" };
Thread.Sleep(2000);
//Can WebDriver find it?
var e = driver.FindElementByXPath(xpath);
Console.WriteLine(e!=null ? "Webdriver success" : "Webdriver failure");
//Can Html Agility Pack find it?
var source = driver.PageSource;
var htmlDoc = new HtmlDocument { OptionFixNestedTags = true };
htmlDoc.LoadHtml(source);
var node开发者_运维技巧s = htmlDoc.DocumentNode.SelectNodes(xpath);
Console.WriteLine(nodes!=null ? "Html Agility Pack success" : "Html Agility Pack failure");
driver.Quit();
In this case, WebDriver successfully located the item, but Html Agility Pack did not.
I know, I know, in this case it's very easy to change the xpath to one that will work: //input[@name='q'], but that will only fix this specific example, which isn't the point, I need something that will exactly or at least closely mirror the behavior of WebDriver's xpath engine, or even the FirePath or FireFinder add-ons to Firefox.
If WebDriver can find it, then why can't Html Agility Pack find it too?
The issue you're running into is with the FORM element. HTML Agility Pack handles that element differently - by default, it will never report that it has children.
In the particular example you gave, this query does find the target element:
.//div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
However, this does not, so it's clear the form element is tripping up the parser:
.//form/div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
That behavior is configurable, though. If you place this line prior to parsing the HTML, the form will give you child nodes:
HtmlNode.ElementsFlags.Remove("form");
精彩评论