Is there a way to access the IE DOM out of process, example is a webpage scraper that loads the currently displayed page and grabs data. I have seen a few ways of downloading the page and processing it, but this will not work when websites are giving back dynamic results a开发者_开发技巧nd require a login.
I am hoping not to have to write a bho to access the data and share it via wcf. I have seen some examples of grabbing the data using c++ and msaa server but that does not really help me in getting it as I would prefer not to use a C++ helper as I have not used c++ in years.
TIA.
Depending on how much stuff you need to do, you might want to consider using something simple like WatiN. It's a great tool for instantiating a browser instance and walking the tree. The DOM manipulation is quite easy and is well documented (with lots of examples on the web).
If you are only doing scraping and requests, you would probably be best off using the WebRequest
object that ships with .NET to do your work.
WebRequest Class @ MSDN
However, if you must have exact access to what is represented in the IE DOM, you should use Microsoft Active Accessibility to gain access. Provided you can identify the window handle or reliable location for the target IE window, and it is visible in a user session, Active Accessibility is the best way to access the target IE window and dig into the DOM. It isn't absolutely necessary to use C++, but it will probably be easier to do most of this in C++.
Active Accessibility User Interface Services @ MSDN
You'll want to use EnumChildWindows to locate (or brute force query) the DOM window either from the desktop or a frame window's handle retrieved from enumerating processes. In .NET, enumeration of processes is available from the System.Process class.
EnumChildWindows @ MSDN
EnumWindows signature @ pinvoke.net
EnumChildWindows signature @ pinvoke.net
Process.GetProcesses() @ MSDN
Process.MainWindowHandle @ MSDN
To add the type declarations you need to be able to walk the DOM in C# and to talk to MSAA, add a COM reference to 'Microsoft HTML Object Library' to your project, and add P/Invoke signatures for MSAA.
AccessibleObjectFromWindow Signature @ pinvoke.net
Once you can call MSAA, retrieve an IDispatch through Active Accessibility from the window handle. You will want to send in OBJID_NATIVEOM
, which will get you an IDispatch
you can interrogate.
Retrieving an IAccessible Object @ MSDN
AccessibleObjectFromWindow() @ MSDN
From here, IDispatch
may be cast to IHTMLWindow2
or IHTMLDocument2
(and derivatives), which has all of the DOM script model methods and more. Unfortunately I can't remember which one is returned via this method, but in any case, IHTMLWindow2
has the document
property (same as window.document
in script). Either can be resolved to provide access to the DOM, which is represented by IHTMLDocument2
and all derived interfaces.
精彩评论