开发者

Extracting the introduction part of a Wikipedia article, by python

开发者 https://www.devze.com 2023-01-27 11:27 出处:网络
I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don\'t see any specia

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this开发者_开发技巧? I'm writing python scripts.

thanks


  1. You may want to check mwlib to parse the wikipedia source
  2. Alternatively, use the wikidump lib
  3. HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

  1. Parsing a Wikipedia dump
  2. How to parse/extract data from a mediawiki marked-up article via python


I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号