开发者

How to properly configure Apache Tika for a few document types?

开发者 https://www.devze.com 2023-03-22 06:47 出处:网络
I\'ve been using Tika for a while and I know that one is supposed to use only Tika facade with either default or custom TikaConfig that represents org/apache/tika/mime/开发者_StackOverflowtika-mimetyp

I've been using Tika for a while and I know that one is supposed to use only Tika facade with either default or custom TikaConfig that represents org/apache/tika/mime/开发者_StackOverflowtika-mimetypes.xml file.

My application doesn't allow any document type different than html,doc,docx,odt,txt,rtf,srt,sub,pdf,odf,odp,xls,ppt,msg

and the default MediaTypes includes tons of others.

Are we supposed to modify tika-mimetypes.xml so that we remove MimeTypes that we don't need ? Then as I understand it will create composite parsers and detectors only for these MimeTypes.

But what happens when it is supplied unsupported type ? Should I just catch TikaException or some SAXException and decline the file ?

Also how is one supposed to manually edit tika-mimetypes.xml ? It has 1290 MimeTypes with mostly ridiculous third party MimeTypes. Why are they there ?


If you want to only accept certain types, then you'll still want the full mimetypes set. Otherwise, how else can you detect that the file someone's just given you is in fact a MP3, and not one of your approved formats? So, keep the full mimtypes set for detecting

Once you've done the detection step, and you've decided it's a valid mimetype, you could just pass the file on to the AutoDetectParser and be done with it. After all, you'd check the mimetype returned by the detector and bail out already if it isn't one you like.

However, if you want an extra check, there are two ways to do it. One is to have a custom org.apache.tika.parser.Parser file, which only lists the parsers for the formats you want to have used. This is the config file that's used to decide which parsers to make available to the AutoDetectParser, so if for example you removed the MP3Parser from that list, then the auto detect parser would stop handling MP3.

The other way is just to have an explicit list of the parsers you wish to support. Then, rather than using the auto detect parser, simple iterate through all of them until you get to one that is able to work on the file, and directly call the parse method on that. This will give you the most contol, but possibly with slightly more work.

0

精彩评论

暂无评论...
验证码 换一张
取 消