开发者

Mining Groups of people from Wikipedia

开发者 https://www.devze.com 2022-12-24 04:59 出处:网络
I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section.

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section.

How should i go about it ? Should开发者_Python百科 I use a crawler and get the pages and search through those using BeautifulSoup ?

Or is there any other alternative to get the same from Wikipedia ?


I would go with Pywikipediabot python project.

Have a look to category.py. You could use:

* tree        - show a tree of subcategories of a given category
* listify     - make a list of all of the articles that are in a category


If you want, you can just download the entire dump of the wikipedia and work it from there. The one your would probably want is only the articles dump dated 3 feb 2010. But beware: It's 5.6 GB in size.


You can use the CatScan tool to search categories.

Instructions here
http://meta.wikimedia.org/wiki/CatScan

Example search - note, html format maxes out at 1000 results. Choose CSV export to retrieve all the results. Also, be sure to modify the category depth and other options, as needed.

The pywikipediabot already mentioned is another option.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号