Extracting header from a web page_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-23 14:26 出处：网络

How can i extract title,header of a web page开发者_Python百科 directly from the internet??You could do this using a combination of regular expressions and the WebRequest / WebResponse classes. For any

相关专题：asp.net

How can i extract title,header of a web page开发者_Python百科 directly from the internet??

You could do this using a combination of regular expressions and the WebRequest / WebResponse classes. For any web scraping needs though, i'd strongly recommend looking into using Simon Mourier's Html Agility Pack, which is much more tolerant of 'bad' HTML, and also allows you to traverse the DOM as a proper XML tree.

Step 1 - use a WebRequest to obtain a WebResponse from the web page you want to extract information from.

Step 2 - you will end up with what is essentially a string, which represents the HTML or XHTML web page, so you need to strip out the bits you want

If you have any problems with either of these steps, make sure your question includes plenty of detail about the problem.

I would use Regex to parse a pages HTML for <title>.*?</title>.

I'm not sure how you would get the "header" though. You would need some sort of rule as to what the header looks like.

If it is just the head tag, you can use the aforementioned title method to get that.