My requirement is to get the news content from different news websites..approximately...250. so news content is somewhere in the body, i can go to the first paragraph of where ever the news content is based on the google snippets/metainfo. but to get the other paragraphs of the news content i am trying to go up the HTML tree till i find a division or a table body...but because of that i am getting some undesired text i.e is not related to the news item. so what i found out is...all the relevant news items in most of the webpages are styled or formatted in a similar way. So is there a way i can capture all the styling happening in the first pa开发者_StackOverflowragraph and then i can filter out unwanted text using that formating information.
I am using HTML agility pack and xpath for my requirement. Thank you.
You could like at my answer of the following question on SO: Advanced HTML Agility Pack usage, it seems to be somewhat related to yours.
精彩评论