开发者

How to link scanned document with its text content to make it searchable?

开发者 https://www.devze.com 2023-01-19 10:30 出处:网络
I have PDF documents containing several images/pages of scanned documents. Their (OCR-produced) text content comes in separate XML files.

I have PDF documents containing several images/pages of scanned documents. Their (OCR-produced) text content comes in separate XML files.

Is it possible to use/link the text content from XML somehow to my PDF files? (Ideally there would be no additional files left in the repository to confuse unaware users.)

As I've been told there's 65k limit on a text property, therefore I can't simply put the text content into a property on the , as the PDF might easily exceed that limit.

A suggestion has been made to pass a stream with the text content to cm:content property of my PDF file. I'm kinda lost here, as IMO that means that either I'm providing a reference or I'm assigning huge string again. The first would mean the text content has to be preserved so开发者_如何学Pythonmewhere as a separate document. The later sounds like I would hit the 65k limit again.

Also I think setting cm:content would probably delete the PDF content itself. I need the PDF binary data to remain untouched.

This is where the suggestion is being discussed. I'm currently trying that anyways.


Soo, it is actually quite easy... What needs to be done is to define a property of type "d:content" on your document; I do that via an aspect...

model.xml:

<aspects>
    <aspect name="mm:my_aspect">
...
            <property name="mm:myTextContentProperty">
                <type>d:content</type>
            </property>
        </properties>
    </aspect>
</aspects>

Then, when I have both PDF and its text representation in the repository, I link those two by adding the aspect and populating the property...

getNodeService().addAspect(pdfNodeRef, myAspect, null);
getNodeService().setProperty(pdfNodeRef, MyModel.MY_TEXT_CONTENT_PROPERTY, new ContentData("store://....bin", "text/plain", size, "UTF-8"));

Now the PDF can be found via both following queries even though it does not contain any text data...

"@\\{http\\://mymodel.ns/content/1.0\\}myTextContentProperty:\"" + string + "\""
"TEXT:\"" + string + "\""

The later is also hinted here, and I guess that is how regular search in Alfresco Web Client works, because now the PDF is reachable using the regular search input.
There is one issue though: the search spits the PDF document and also the document I link using the property. So now I need to hide the later from search results...

(When searching using the first query only the PDF is found, as expected; but that approach is of little use to me.)

Hopefully it saves some time to other Alfresco-newbies. :)


Another way to achieve what I need would be setting MY_TEXT_CONTENT_PROPERTY using contentService...

ContentWriter writer = getContentService().getWriter(pdfNodeRef, MyModel.MY_TEXT_CONTENT_PROPERTY, true);
writer.setMimetype("text/plain");
writer.setEncoding("UTF-8");
writer.putContent(stringFromXmlDescription); // the source XML gets thrown away

(Important thing seems to be to put the content after the mimetype and encoding are set. Otherwise the content/property is not searchable.)

With this approach there's no need to hide the linked text documents, there aren't any.

0

精彩评论

暂无评论...
验证码 换一张
取 消