开发者

Read only the title and/or META tag of HTML file, without loading complete HTML file

开发者 https://www.devze.com 2023-03-26 04:29 出处:网络
Scenario : I need to parse millions of HTML files/pages (as fact as I can) & then read only only Title or Meta part of it & Dump it to Database

Scenario :

I need to parse millions of HTML files/pages (as fact as I can) & then read only only Title or Meta part of it & Dump it to Database

What I am doing is using System.Net.WebClient Class'开发者_运维问答s DownloadString(url_path) to download & then Saving it to Database by LINQ To SQL

But this DownloadString function gives me complete html source, I just need only Title part & META tag part.

Any ideas, to download only that much content?


I think you can open a stream with this url and use this stream to read the first x bytes, I can't tell the exact number but i think you can set it to reasonable number to get the title and the description.

HttpWebRequest fileToDownload = (HttpWebRequest)HttpWebRequest.Create("YourURL");
            using (WebResponse fileDownloadResponse = fileToDownload.GetResponse())
            {
                using (Stream fileStream = fileDownloadResponse.GetResponseStream())
                {
                    using (StreamReader fileStreamReader = new StreamReader(fileStream))
                    {
                        char[] x = new char[Number];
                        fileStreamReader.Read(x, 0, Number);
                        string data = "";
                        foreach (char item in x)
                        {
                            data += item.ToString();
                        }
                    }
                }
            }


I suspect that WebClient will try to download the whole page first, in which case you'd probably want a raw client socket. Send the appropriate HTTP request (manually, since you're using raw sockets), start reading the response (which will not be immediately) and kill the connection when you've read enough. However, the rest will have probably already been sent from the server and winging its way to your PC whether you want it or not, so you might not save much - if anything - of the bandwidth.

Depending on what you want it for, many half decent websites have a custom 404 page which is a lot simpler than a known page. Whether that has the information you're after is another matter.


You can use the verb "HEAD" in a HttpWebRequest to return the the response headers (not element. To get the full element with the meta data you'll need to download the page and parse out the meta data you want.

System.Net.WebRequest.Create(uri) { Method = "HEAD" };
0

精彩评论

暂无评论...
验证码 换一张
取 消