I have some ideas of how to build a more intelligent web spider, which interacts with a web page and extracts information in a manner more similar to how us humans do.
To do this I need a representation of a web page 开发者_如何转开发that is similar or identical to that we see in our browsers
In other words I need access to the data concerning the location, colour and style of all the elements on the page, possibly at a pixel level.
But I don't want just a rendered bitmap, I want to be able to extract text, click links and push buttons and so on
I get the feeling the DOM model may be a starting point but more concrete advice would be appreciated
To clarify, I want to programmatically obtain access to web pages in a form similar to that presented to us by a browser, but for example to check the colour or text at a specific pixel location or region.
You might want to check out Selenium (or other ways of scripting your browser, such as greasemonkey). Since how a web page is displayed depends quite a bit on the particular browser, scripting one is obviously the most precise way of getting to what the user sees.
精彩评论