开发者

How To Search Domain Objects And The Physical Files They Point To Using Solr Or Searchable

开发者 https://www.devze.com 2023-01-22 03:24 出处:网络
I have a digital library system where I store metadata and the path to physical file in the database. The files may be anything: plain text,Word,PDF,MP3,JPEG,MP4...

I have a digital library system where I store metadata and the path to physical file in the database. The files may be anything: plain text,Word,PDF,MP3,JPEG,MP4...

How can I provide full text search to both my domain objects and the physical files (or some text extraction of the files).

Is my only choice to store the document text in the domain object? I d开发者_JAVA百科o need to be able to retrieve a list of domain objects regardless of if the search results come from the domain object or the physical document. There is of course a possible connection using the file path and I actually drop each document in a folder named by a GUID, so the connection is there.

I need to do this in Grails, ideally using the solr or searchable plugin, but a Java solution would help.


You don't need to store the content in the domain object, just associated the content with the domain object when creating the index entry. I used Apache POI to extract my content, but there are higher level services like Apache Tika

you could code it up in java using Lucene directly but I would suggest SOLR instead

grails searchable plugin based on Compass which is based on Lucene


Have a look at this article that covers use cases like yours, based on Spring, Hibernate, Hibernate Search, and JSF. It comes with a comprehensive, well-documented, sample application.

Which is focusing on the separations of concerns paradigm and modularity, BTW. Thus, the concepts involved that concern full-text searching ought to suit fine with Grails, or other, Java-based, applications.

The main domain class is de.metagear.library.model.Media (there is an associated MetaData domain class, too). You'll be able to mix Hibernate and GORM classes; however, you'll need to use different APIs then.

The Media class contains a property plainText:

@Column(name = "plain_text", nullable = false)
@Field(index = Index.TOKENIZED, store = Store.YES)
@Lob
private String plainText;

That property holds the extracted text (i.e., from PDFs, etc.). I'm not sure whether it needs to be saved to the database or not (probably not, but it should't harm too much otherwise). Nevertheless, it's not used for full-text search (see below). For full-text search, the Lucene indexes are used, only.

Before a Media is created, the text contents of the corresponding orginal document (possibly, a binary one) is extracted. The de.metagear.library.model.factory.MediaFactory.getInstance(..) method extracts the text, stores the extracted text in a new Media object, and returns that Media.

In the sample, it simply stores the original document into a property of the domain object, but, at that place, you could also save the document to file and store a reference (the GUID you'd mentioned) into a Media's property.

The domain class is saved by the de.metagear.library.dao.MediaCrudDaoImpl class, which is a Spring bean. Internally, it's using an injected EntityManagerFactory that, in /applicationContext.xml, has been defined to use Hibernate under the hood.

Indexing occurs, automatically, because of the Hibernate annotations in the domain class.

As for performing the full-text search itself, that's accomplished by the de.metagear.library.dao.MediaSearchDaoImpl.getSearchResults(..) method that does not query the database, but the Lucene indexes, only.

The sample application contains a powerful query terms pre-processor that can combine AND, OR, and NOT operators on different indexes while preserving the comprehensive Lucene expression syntax.

By setting a custom org.hibernate.transform.ResultTransformer at this place, objects of any type (including domain classes, of course) can be returned.


I haven't looked into the Grails Searchable plugin, yet, and, thus, cannot tell whether it's viable in terms of robustness, maintainability, ease of use, and - last-not-least - extensibility with custom or third-party content extractors, parsers, and analyzers. Probably, it is, as well.

After all, there's a basic knowledge of the Spring and (maybe) the Hibernate framework involved with my approach. These are just the frameworks that Grails and Gorm are based on, but I think that this might make a decision point for you.

At least, looking at the above concepts ought to be informative and empower to advance when looking at different frameworks and approaches.

Thanks.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号