My end objective is to index documents using lucene. As lucene doesnt support indexing other formats. I want to convert these files to txt/html (lucene indexable file types). I have a set of documents almost 1000 files of ppt, pdf, doc, xl etc Please help m开发者_运维技巧e
You could use OpenOffice headless to convert the files from one format to another, say Excel/Doc to TXT/HTML.
We use a similar process combined with ImageMagick to allow people to upload office documents into a presentation app.
Below are a few examples/tutorials on how to achieve this:
Setup OpenOffice
http://code.google.com/p/openmeetings/wiki/OpenOfficeConverter
JOD Converter (Java)
http://artofsolving.com/opensource/jodconverter
PyOD Converter (Python)
http://artofsolving.com/opensource/pyodconverter
If you need any further help with OOo feel free to ask
Good luck :)
You now (2022) have a python opensource that does this: https://github.com/shakiyam/pptx2txt
精彩评论