开发者

HTML scraping using YQL

开发者 https://www.devze.com 2023-03-16 04:57 出处：网络

I am trying to use YQL to scrape some websites. When I test various queries in the YQL console I get an results node. So for example when I run:

相关专题：web-scraping yql

I am trying to use YQL to scrape some websites. When I test various queries in the YQL console I get an results node. So for example when I run:

select * from html where url="http://www.rev开发者_高级运维erbnation.com/" and xpath='/html/body'

I get an empty <results /> node (permalink). Thanks in advance!

http://www.reverbnation.com may be blocking the request coming from Yahoo! based on certain criteria, like headers. I had a look at reverbnation's robots.txt, and they aren't blocking Yahoo! based on the "Yahoo Pipes 2.0" user agent, so it must be something else.

To re-create the issue, make a YQL query against your own site, then look at the full access logs to see the full request and all headers that came from Yahoo! Then make a similar request using a tool like cURL.

You can also try and run netcat on a port and query with http://yoursite.com:PORT to see the full request.

Related issue discussed here.