Html parser to get blog posts_问答_开发者_运维开发者技术经验分享

I need to create a html parser, that given a blog url, it returns a list, with all the posts in the page.

I.e. if a page has 10 posts, it should return a list of 10 divs, where each div contains h1 and a p

I can't use its rss feed, because I need to know exactly how it looks like for the user, if it has any ad,开发者_JS百科 image etc and in contrast some blogs have just a summary of its content and the feed has it all, and vice-versa.

Anyway, I've made one that download its feed, and search the html for similar content, it works very well for some blogs, but not for others.

I don't think I can make a parser that works for 100% of the blogs it parses, but I want to make the best possible.

What should be the best approach? Look for tags that have its id attribute equal "post", "content"? Look for p tags? etc etc etc...

Thanks in advance for any help!

I don't think you will be successful on that. You might be able to parse one blog, but if the blog engine changes stuff, it won't work any more. I also don't think you'll be able to write a generic parser. You might even be partially successful, but it's going to be an ethereal success, because everything is so error prone on this context. If you need content, you should go with RSS. If you need to store (simply store) how it looks, you can also do that. But parsing by the way it looks? I don't see concrete success on that.

"Best possible" turns out to be "best reasonable," and you get to define what is reasonable. You can get a very large number of blogs by looking at how common blogging tools (WordPress, LiveJournal, etc.) generate their pages, and code specially for each one.

The general case turns out to be a very hard problem because every blogging tool has its own format. You might be able to infer things using "standard" identifiers like "post", "content", etc., but it's doubtful.

You'll also have difficulty with ads. A lot of ads are generated with JavaScript. So downloading the page will give you just the JavaScript code rather than the HTML that gets generated. If you really want to identify the ads, you'll have to identify the JavaScript code that generates them. Or, your program will have to execute the JavaScript to create the final DOM. And then you're faced with a problem similar to that above: figuring out if some particular bit of HTML is an ad.

There are heuristic methods that are somewhat successful. Check out Identifying a Page's Primary Content for answers to a similar question.

Use the HTML Agility pack. It is an HTML parser made for this.

I just did something like this for our company's blog which uses wordpress. This is good for us because our wordress blog hasn't changed in years, but the others are right in that if your html changes a lot, parsing becomes a cumbersome solution.

Here is what I recommend:

Using Nuget install RestSharp and HtmlAgilityPack. Then download fizzler and include those references in your project (http://code.google.com/p/fizzler/downloads/list).

Here is some sample code I used to implement the blog's search on my site.

using System;
using System.Collections.Generic;
using Fizzler.Systems.HtmlAgilityPack;
using RestSharp;
using RestSharp.Contrib;

namespace BlogSearch
{
    public class BlogSearcher
    {
        const string Site = "http://yourblog.com";

        public static List<SearchResult> Get(string searchTerms, int count=10)
        {            
            var searchResults = new List<SearchResult>();

            var client = new RestSharp.RestClient(Site);
            //note 10 is the page size for the search results
            var pages = (int)Math.Ceiling((double)count/10);

            for (int page = 1; page <= pages; page++)
            {
                var request = new RestSharp.RestRequest
                                  {
                                      Method = Method.GET,
                                      //the part after .com/
                                      Resource = "page/" + page
                                  };

                //Your search params here
                request.AddParameter("s", HttpUtility.UrlEncode(searchTerms));

                var res = client.Execute(request);

                searchResults.AddRange(ParseHtml(res.Content));
            }

            return searchResults;
        }

        public static List<SearchResult> ParseHtml(string html)
        {            
            var doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);
            var results = doc.DocumentNode.QuerySelectorAll("#content-main > div");

            var searchResults = new List<SearchResult>();
            foreach(var node in results)
            {
                bool add = false;
                var sr = new SearchResult();

                var a = node.QuerySelector(".posttitle > h2 > a");
                if (a != null)
                {
                    add = true;
                    sr.Title = a.InnerText;
                    sr.Link = a.Attributes["href"].Value;
                }

                var p = node.QuerySelector(".entry > p");
                if (p != null)
                {
                    add = true;
                    sr.Exceprt = p.InnerText;
                }

                if(add)
                    searchResults.Add(sr);
            }

            return searchResults;
        }


    }

    public class SearchResult
    {
        public string Title { get; set; }
        public string Link { get; set; }
        public string Exceprt { get; set; }
    }
}

Good luck, Eric