开发者

Read Microsoft Word Documents into Plain Text (DOC, DOCX) in Java

开发者 https://www.devze.com 2022-12-19 23:59 出处:网络
I\'m looking for something in Java to read in Word documents to process their text.. all I need is there text, n开发者_运维百科othing fancy.I know about Apache POI, however it doesn\'t include support

I'm looking for something in Java to read in Word documents to process their text.. all I need is there text, n开发者_运维百科othing fancy. I know about Apache POI, however it doesn't include support for DOCX right now, anything out there?


If you don't require formatting information, images and all other fancy stuff, then the job is lot easier. Just some 5 to 10 lines of code will do.

  1. Treat DOCX as a zip file. It consists a bunch of files which includes 'document.xml'. Use ZipInputStream and extract that file alone. (you may use your favorite zip utility and open docx and see for yourself!)
  2. Use a SAX parser and read contents between node body/p/r/t - voila you got the text!

This is applicable only if you need the text only.


With some googling I found OpenXML4J. This might solve your issue. I have not used this before I am sure someone in the community will have better insight.

Note: This is a duplicate question. This has the solution plus a bit of discussion. Link to the question.


Try apache poi - it can handle doc, docx, xls, xlsx, ppt, pptx.

Another production-level solution is OpenOffice in headless mode which can even be used in a server-side scenario.


You could try docx4j; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号