How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?_问答_开发者

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

开发者 https://www.devze.com 2022-12-26 00:11 出处：网络

Also I want to know how to add meta data while indexing so that i can boost some parameter开发者_运维技巧sThere are several frameworks for extracting text suitable for Lucene indexing from rich text f

Also I want to know how to add meta data while indexing so that i can boost some parameter开发者_运维技巧s

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)

One of them is Apache Tika, a sub-project of Lucene.
Apache POI is a more general document handling project inside Apache.
There are also some commercial alternatives.

You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

The code will look like this. Reader reader = new Tika().parse(stream);

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

see https://github.com/WolfgangFahl/pdfindexer for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text, index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.