I was wondering if someone could give me some guidance here. I'd like to be able to programatically get every image on a webpage as quickly as possible. This is what I'm currently doing: (note that clear is a WebBrowser control)
if (clear.ReadyState == WebBrowserReadyState.Complete)
{
doc = (IHTMLDocument2)clear.Document.DomDocument;
sobj = doc.selection;
body = doc.body as HTMLBody;
sobj.clear();
range = body.createControlRange() as IHTMLControlRange;
for (int j = 0; j < clear.Document.Images.Count; j++)
{
img = (IHTMLControlElement)clear.Document.Images[j].DomElement;
HtmlElement ele = clear.Document.Images[j];
string test = ele.OuterHtml;
string test2 = ele.InnerHtml;
range.add(img);
range.select();
range.execCommand("Copy", false, null);
Image image = Clipboard.GetImage();
if (image != null)
{
temp = new Bitmap(image);
Clipboard.Clear();
......Rest of code ...........
}
}
}
However, I find this can be slow for alot of im开发者_运维技巧ages, and additionally it hijacks my clipboard. I was wondering if there is a better way?
I suggest using HttpWebRequest
and HttpWebResponse
. In your comment you asked about efficiency/speed.
From the standpoint of data being transferred using HttpWebRequest
will be at worst the same as using a browser control, but almost certainly much better. When you (or a browser) makes a request to a web server, you initially only get the markup for the page itself. This markup may include image references, objects like flash, and resources (like scripts and css files) that are referenced, but not actually included in the page itself. A web browser will then proceed to request all the associated resources needed to render the page, but using HttpWebRequest
you can request only those things that you actually want (the images).
From the standpoint of resources or processing power required to extract entities from a page, there is no comparison: using a broswer control is far more resource intensive than scanning an HttpWebResponse
. Scanning some data using C# code is extremely fast. Rendering a web page involves javascript, graphics rendering, css parsing, layout, caching, and so on. It's a pretty intensive operation, actually. Using a browser under programmatic control, this will quickly become apparent: I doubt you could process more than a page every second or so.
On the other hand, a C# program dealing directly with a web server (with no rendering engine involved) could probably handle dozens if not hundreds of pages per second. For all practical purposes, you'd really be limited only by the response time of the server and your internet connection.
There are multiple approaches here.
If it's a one time thing, just browse to the site and select File > Save Page As... and let the browser save all the images locally for you.
If it's a recurring thing there are lots of different ways.
buy a program that does this. I'm sure there are hundreds of implementations.
use the html agility pack to grab the page and compile a list of all the images I want. Then spin a thread for each image that downloads and saves it. You might limit the number of threads depending on various factors like your (and the sites) bandwidth and local disk speed. Note that some sites have arbitrary limitations placed on the number of concurrent requests per connection they will handle. Depending on the site this might be as few as 3.
This is by no means conclusive. There are lots of other ways. I probably wouldn't do it through a WebBrowser control though. That code looks brittle.
精彩评论