Parsing XML by using nested loops and returning multiple values_问答_开发者

I have a large XML file having the following structure:

<brain>
<q>
  <question> What are your hobbies? </question>
  <question> What do you do for sake of fun? </questi开发者_如何学运维on>
  <question> How do you spend your spare time? </question>
  <question> What are your interests? </question>
  <question> What do you enjoy most? </question>
  <answer> I like [personal_info/hobby] </answer>
  <answer>[personal_info/hobby]</answer>
  <answer>I enjoy [personal_info/hobby] </answer>
</q>

<q>
  <question> Where do you live? </question>
  <question> What city do you live in? </question>
  <question> Where are you from? </question>
  <question> Where are you living? </question>
  <question> Where is your residence? </question>
  <answer> I live at [personal_info/loc] </answer>
  <answer> I am living in [personal_info/loc]</answer>
  <answer> At [personal_info/loc]</answer>
  <answer>   [personal_info/loc]</answer>
</q>
.
.
.
</brain>

As you might have guessed, it is a database for a chatbot. The idea is that the user will enter a question (or any sentence for that matter) and our java-based chatbot will run an XQuery over this file. The XQuery implementation that I am using (known as nux) provides a fuzzy matching of sentence similarity and so will return sentences that partially match. Here is some code to illustrate this:

Nodes results = XQueryUtil.xquery(doc, "declare namespace lucene = \"java:nux.xom.pool.FullTextUtil\"; "
    + "for $q in /brain/q "
    + " for $question in $q/question"
    + " let $score := lucene:match($question, \"How are you\") "
    + " where $score > 0.1 "
    + " order by $score descending "
    + "return $q/answer");

This code is supposed to loop through each brain/q and then q/question and if its similarity score is more than 0.1, it should return <answer>'s of that are in that <q>. The problem is that it returns ALL answer tags. For example if "What are your hobbies?" is asked, it should return

  <answer> I like [personal_info/hobby] </answer>
  <answer>[personal_info/hobby]</answer>
  <answer>I enjoy [personal_info/hobby] </answer>

but returns all the answer tags found in the file. It also repeats them again and again for unpredictable number of times.

Can you please help me on this?

The dataset was generated by running various scripts and were collected and manually checked by me. If necessary, I can change the structure of XML to solve this problem but will prefer not if it is possible.

Thanks for taking time to read my question and thinking to help.

I know from experience with Lucene (or actually Solr); that a similarity score of 0.1 is achieved pretty quickly. Which would explain why all answers in the file are returned.

In the search system I'm using Solr for, I use a boundary of about 0.4~0.6 (depending on the searched fields).

You could try to show the returned scores per question (if the XQueryUtil allows that); to see how much the correct ones match for lines you want to match. That way you can select a good/better boundary.

Another way would be to just try using some higher values, and see if you get less answers back, and try to find a correct value by trial and error.

Important: scores in solr are always relative. You should never compare them to a fixed value since they differ from query to query and they are not normalized.

Why not use a normal query, set rows to 1 or 10 and ( automatically) order by score? You can make answers a multi valued field and create one document per q.

You should run the best matches through your own quality function anyways to take care of at least minimal semantic matching.