I'm doing my project on Text Categorization.I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line: The DTD file lewis.dtd is included in the distribution. Following the d开发者_运维问答ocument type declaration line are individual Reuters articles marked up with SGML tags.
I need help to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
can somebody plzz help me?????
From about five minutes of googling, it seems that there are no free SGML parsers for Java. This is rather surprising, but there you go.
I suggest you get hold of James Clark's SX tool, from the SP package, which is not Java but which is portable C, and use it to convert the SGML to XML. You can then parse the XML with a Java XML parser.
Lucene has such an extractor in org.apache.lucene.benchmark.utils.ExtractReuters;
I have not actually tried to run it from the the jar file (Maven repo), but you can easily use (and modify) the java source code found here as it has no external dependencies.
Note that this code exports a large number of small files (21578 actually).
Though it's very old post but my answer is for future needy persons because I struggled a lot before doing it in this way. I can't say that its a suitable approach or a good solution but it served the purpose and for last 6 months its running continuously to do batch process. I wrote some custom code to read and parse the SGML files and it successfully did the job for even quit large files. Though the output format is in different structure as required in my case. You can have a look and if it seems useful you can do some tweaking to utilise it. Please have a look here
精彩评论