开发者

can't get proper information from amazon.com using c#/htmlagilitpack

开发者 https://www.devze.com 2023-02-28 20:22 出处:网络
I want to get book information such as author name / pages / publish year / etc ... from amazon using HtmlAgilityPack but seems amazon webpages have some problems and I can\'t access the appropriate f

I want to get book information such as author name / pages / publish year / etc ... from amazon using HtmlAgilityPack but seems amazon webpages have some problems and I can't access the appropriate fields.

here is what I've done :

I use Firefox and Firebug + FirePath to retrieve desired XPath and then inside my code I summon HtmlAgilityPack and instruct it to get information using acquired XPath that I've got it from Firebug but no luck and till now I couldn't access the "Product Details" part of the amazon.com

and this is my XPath (which is working only with HtmlAgilityPack)

HtmlAgilityPack.HtmlNodeCollection cnt = doc.DocumentNode.SelectNodes("//*[@class='content']");
int i=1;
foreach (HtmlAgilityPack.HtmlNode content in cnt)
开发者_运维问答{
    if (i != 3)
    {
        i++;
        continue;
    }
    if (i == 3) // i==3 means I've reached the product details but I can't go any further :(
    {

        s = content.SelectSingleNode("").OuterHtml;

      //  break;
    }

}

How can I access Product Details using appropriate understandable XPath for HtmlAgilityPack?

And why does the syntax of Firebug + FirePath XPath is different from HtmlAgilityPack?


As @Mystere said, I suggest using the API. But if you are doing this for test purpose, or just because you want to use web scraping to obtain the info (I'm not sure if Amazon allows it or not. You should check it before doing this), here is the thing:

Why are you doing this?

s = content.SelectSingleNode("").OuterHtml;

The following is what you are looking for in case you want to get the HTML source of that part of the page.

s = content.OuterHtml;

When you are scraping, I suggest you trying to identify the part you need to scrape, and see the particularities of that block of content.

If you use:

var node = doc.DocumentNode.SelectNodes("//td[@class='bucket']/div[@class='content']");

that will give you the Product Details block you are looking for. If you want to get some fields like Paperback, Publisher, ... you can do:

string paperback = node.SelectSingleNode("./ul/li[1]/text()").InnerText;
string publisher = node.SelectSingleNode("./ul/li[2]/text()").InnerText;
string language = node.SelectSingleNode("./ul/li[3]/text()").InnerText;
...

If you want to be sure that the XPath you are using will be correct for HtmlAgilityPack, open the page on Internet Explorer 8 (or 9) and use the Developer Tools (F12) to get the XPath. The thing is that each browser renders the HTML in a particular way. For example, you will always see <tbody> tags in Firefox right after a <table>, so maybe HtmlAgilityPack doesn't, and that simple detail of adding /tbody/ to your XPath can make your program fail.


Why don't you just use amazon's web service api that is designed to do this?

0

精彩评论

暂无评论...
验证码 换一张
取 消