开发者

AppEngine fetch through a free proxy

开发者 https://www.devze.com 2022-12-16 07:31 出处:网络
My (Python) AppEngine program fetches a web page from another site to scrape data from it -- but it seems like the 3rd party site is blocking requests from Google App Engine! -- I can fetch the page f

My (Python) AppEngine program fetches a web page from another site to scrape data from it -- but it seems like the 3rd party site is blocking requests from Google App Engine! -- I can fetch the page from development mode, but not when deployed.

Can I g开发者_JS百科et around this by using a free proxy of some sort?

Can I use a free proxy to hide the fact that I am requesting from App Engine?

How do I find/choose a proxy? -- what do I need? -- how do I perform the fetch?

Is there anything else I need to know or watch out for?


Probably the correct approach is to request permission from the owners of the site you are scraping.

Even if you use a proxy, there is still a big chance that requests coming through the proxy will end up blocked as well.


Have you considered changing the user-agent?

result = urlfetch.fetch(u,headers = {'User-Agent': "Mozilla/5.0"},allow_truncated=True) 

The API will always append "AppEngine-Google;" to the user-agent, but this might work if the restriction is not based on a IP address range.


What you are talking about is a valid bug in app engine sdk. Have a look at http://code.google.com/p/googleappengine/issues/detail?id=544 for bug updates, and workarounds for java and python.


I'm currently having the same problem and i was thinking about this solution (not yet tried) :

-> develop an app that fetch what you want -> run it locally -> fetch your local server from your initial

so the proxy is your computer which you know as not blocked

Let me know if it's works !


Well to be fair, if they don't want you doing that then you probably shouldn't. It's not nice to be mean.

But if you really want to do it, the best approach would be creating a simple proxy script and running it on a VPS or some computer with a decent enough connection.

Basically you expose a REST API from your server to your GAE, then the server just makes all the same requests it gets to the target site and returns the output.

0

精彩评论

暂无评论...
验证码 换一张
取 消