I want to programmatically parse a pdf file, look for certain phrases and find out the page number that each phrase is on. Is this possible (I understand that开发者_开发问答 pdf is not like a text file)? Is so, are there libraries out there that can help?
Apache Tika, which you can find at the Apache Lucene project, includes PDFBox, which will pull out the text where you can work with it.
精彩评论