I'm trying to scrape a webpage using Selenium (in Python) that is almost entirely Javascript.
For instance, this is the body of the page:<body class="bodyLoading">
<!-- this is required for GWT history support -->
<iframe id="__gwt开发者_如何学JAVA_historyFrame" role="presentation" width="0" height="0" tabindex="-1" title="empty" style="position:absolute;width:0;height:0;border:0" src="javascript:''"> </iframe>
<!-- For printing window contents -->
<iframe id="__printingFrame" role="presentation" width="0" height="0" tabindex="-1" title="empty" style="width:0;height:0;border:0;" />
<!-- TODO : RECOMMENDED if your web app will not function without JavaScript enabled -->
<noscript>
<div style="width: 22em; position: absolute; left: 50%; margin-left: -11em; color: red; background-color: white; border: 1px solid red; padding: 4px; font-family: sans-serif">
Your web browser must have JavaScript enabled in order for
Regulations.gov to display correctly.
</div>
</noscript>
</body>
For some reason, selenium (using the Firefox engine) does not evaluate the javascript on this page. If I use the get_html_source
function, it just returns the html above, not the JavaScript imported HTML that I can see in my browser (and in the Selenium browser). And, unfortunately, I can't figure out the src
attibute from the iFrame just says javascript:
which I can't figure out.
Any thoughts on how to make sure Selenium process this iFrame?
The iframes are separate documents, so you won't get their contents included in the HTML code for the main page; you have to read them separately.
You can do this using Selenium's select_frame
function.
You can access a frame via its name, CSS selector, xpath reference, etc, as with other elements.
When you select the frame you change Selenium's context, so you can then access the frame's contents as if it was the current page.
If you have frames within frames, you can continue this process down through the frame tree.
Obviously, you need a method of returning back up the frame path. Selenium provides this, by allowing you to use the same select_frame
function, with a parameter of either relative=up
to move the context to the parent of the current frame, or relative=top
to move to the main page in the browser.
So using this function you can navigate around the frames in a page.
You can't access them all at once; only one frame can be in context at once, so you'll never be able to make a single get_html_source
call and get all the frames' contents at once, but you can navigate around frames in the page within your Selenium script and get the HTML source for each frame separately.
Hope that helps.
精彩评论