I don't know if this can be asked here, but I have looked so hard for this and have reached deadend time and again. I'm working on a project for Information Retrieval Research. I've coded up my search engine but cannot test it because I need this xml corpus of Wikipedia. This I found http://www-connex.lip6.fr/~de开发者_JAVA百科noyer/wikipediaXML/ but it turns out useless. Please let me know if someone knows a way of getting me this corpus
The page you provided looks like to be presenting the Wikipedia XML corpus used in the 2007 INEX workshop. I've found this site which holds the wikipedia dataset used in 2009-2010 ad hoc (I think clustering too) track in INEX. I think you can use it as well.
Just in case you can use the official wikimedia XML dump: English Wikipedia Dumps. More information and other languages: Wikipedia Database Download
精彩评论