开发者

Could not scrape data in English, help!

开发者 https://www.devze.com 2023-03-12 02:10 出处:网络
I have a website that I\'m trying to scrape using Python & BeautifulSoup. The site itself can be viewed in 2 languages(Thai or English); all you have to do is to click on either the Thai or UK fla

I have a website that I'm trying to scrape using Python & BeautifulSoup. The site itself can be viewed in 2 languages(Thai or English); all you have to do is to click on either the Thai or UK flag on the upper right corner of the screen and the data is displayed in the selected language. When in comes to the script though, I can only scrape the data in Thai开发者_Go百科 (which is the default language) and I couldn't figure out how to get the data in English because the URL doesn't change when you click on either the Thai or UK flag. Looking at the source for the page, there are no href associated with either flag. I turned on Firebug tracing and tried to search for something to give me a clue but haven't found anything (then again you'd have to know exactly what to look for in order to know what's going on and that's my problem).

Thanks, Glenn


You haven't said what the site is so impossible to answer for sure. But a couple of suggestions. If the url does not change when you click the flag, then either:

a) The english is already in the html document, and the relevant content is being switched with javascript b) The english content is being fetched via an ajax request and javascript is being used to edit the DOM c) The page fully reloads with english content.

Presumably in all these cases the language preference must be stored either server-side in the session or client-side with cookies.

First tests are try turning off cookies and javascript to see what happens. Then with cookies, js back on use Firebug or Firefox to view network requests being made.


Here's the cookie:

Cookie  verify=test; LangName=th; ASP.NET_SessionId=ylulkp45qpjq2b453nurlp55; _cbclose=1; _cbclose30246=1; _uid30246=66B70BE9.1; _ctout30246=1

If you change the language, it sets the LangName=en.

urllib2 can used in conjunction with cookielib to enable storing and reusing cookies.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号