开发者

Subfield searching in Solr on complex datatype that requires multiValued=true

开发者 https://www.devze.com 2023-04-03 03:36 出处:网络
I\'m building a solution that has a main LuceneDocument of type \"Artifact\". In the artifact, we are extracting complex data types for further classifying and organizing the data.As part of our Solr

I'm building a solution that has a main LuceneDocument of type "Artifact".

In the artifact, we are extracting complex data types for further classifying and organizing the data. As part of our Solr indexing, our objective is to allow users to execute search queries across the full text of the artifact, as well as, across these complex data types.

Example text:

"The young man, named John Doe, jumped over the lazy dog then had dinner at 123 Indiana Blvd, Pittsburgh, PA 15235 with the brown fox. After a some time, the brown fox flew to visit the President at 1600 Pennsylvania Ave, Washington, DC 20500"

Our extraction process will pull out three useful entities:

  1. Person - John Doe
  2. Address - 123 Indiana Blvd, Pittsburgh, PA 15235
  3. Address - 1600 Pennsylvania Ave, Washington, DC 20500

We need to further decompose #2 & #3 into

  • streetAddressOne
  • city
  • state
  • zipCode

During Solr publishing, we will create an indexable object (using Solr4J) with the following fields:

@Field
String artifactBody;

@Field
List<String> streetAddressOne;

@Field
List<String> city;

@Field
List<String> state;

@Field
List<String> zipCode;

@Field
List<String>开发者_StackOverflow社区; person;

All goes well and we publish these records to Solr with no issues.

On the user search, "streetAddressOne:Indiana AND city:Washington", we will receive a false positive. Now, the reality is that Washington, DC does in fact have an Indiana Avenue so the search is a valid address description.

That is the overall description of our use case and I'm inquiring about some alternative approaches that will guarantee this false positive is not returned but matching on complex types is still available.

I started down the PolyField type but that doesn't seem to apply when you want to search a subset of all the fields in the set.

I've also investigating the path of making the Address a LuceneDocument item that is published to Solr. The challenge is that the result set needs to be reflect the list of Artifacts, not the list of Addresses. This "join-like" capability is just not available in a search engine so I moved away from this idea.

0

精彩评论

暂无评论...
验证码 换一张
取 消