开发者

How to convert pdf, ppt, xl, doc files to txt/html files... any opensource tools/codes in php/python/perl available?

开发者 https://www.devze.com 2022-12-27 07:49 出处:网络
My end objective is to index documents using lucene. As lucene doesnt support indexing other formats. I want to convert these files to txt/html (lucene indexable file types).

My end objective is to index documents using lucene. As lucene doesnt support indexing other formats. I want to convert these files to txt/html (lucene indexable file types). I have a set of documents almost 1000 files of ppt, pdf, doc, xl etc Please help m开发者_运维技巧e


You could use OpenOffice headless to convert the files from one format to another, say Excel/Doc to TXT/HTML.

We use a similar process combined with ImageMagick to allow people to upload office documents into a presentation app.

Below are a few examples/tutorials on how to achieve this:

Setup OpenOffice

http://code.google.com/p/openmeetings/wiki/OpenOfficeConverter

JOD Converter (Java)

http://artofsolving.com/opensource/jodconverter

PyOD Converter (Python)

http://artofsolving.com/opensource/pyodconverter

If you need any further help with OOo feel free to ask

Good luck :)


You now (2022) have a python opensource that does this: https://github.com/shakiyam/pptx2txt

0

精彩评论

暂无评论...
验证码 换一张
取 消