XML parser + Indexing data_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-13 22:52 出处：网络

I need to index some xml documents with Lucene, but before that, i need to parse those XML and extract some info inside their tags.

The XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="es" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#sty开发者_JAVA百科ling">
  <head>
        <styling>
            <style id="bl" tts:fontWeight="bold" tts:color="#FFFFFF" tts:fontSize="15" tts:fontFamily="sansSerif"/>
       </styling>
  </head>

  <body>
    <div xml:lang="es">
            <p begin="00:00.50" end="00:04.02" style="bl">Info</p>
            <p begin="00:04.32" end="00:07.68" style="bl">Different words,<br />and phrases to index</p>
            <p begin="00:11.76" end="00:16.04" style="bl">Text</p>
            <p begin="00:18.52" end="00:22.88" style="bl">More and<br />more text</p>
   </div>
  </body>
</tt>

I need to extract only the timestamps inside the tags begin and end, and then index the text inside the p tags. The goal is to query the text indexed and know in which timestamp gap are each hit.

For example, if i query the word "Text" the output should say something like: "2 hits, 00:11.76-00:16.04, 00:18.52-00:22.88"

I started indexing the entire XML with Lucene. Now i want to parse the file, but im not sure what is the best approximation to solve this problem.

Any help or advice is welcome :) Thank you all!

I used the SAX library (i.e., a subclass of org.xml.sax.helpers.DefaultHandler ) to parse XML files, extracted the desired information from each XML document into my own Document class, and then indexed that Document instance. (The indirection was due to having multiple document formats that had to be parsed separately, but indexed in the same index.) In your case, if the contents of each of your <body> elements represents a logical document, you can store the date information as payloads associated with specific tokens. Parse the XML to the <p> level, enumerate the paragraph instances, and for each instance, add a new Field instance with the same name, where the value is the text, and the payload is the date information, suitably represented. (Payloads are binary, so, for example, you could store the two long values corresponding to the start and end times.) When you add multiple field instances with the same name to a document, they get indexed as the same field, but you can assign different payloads to each instance, you can adjust the position of the start of the text, etc.

If you don't need the contents of each element as a single document, you can treat each <p> as a separate document, and then set the payload on that. Alternatively, you can store dates as a separate field.

I can highly recommend storing all your XML in an eXist database, which has Lucene built-in. I've been using this combination for a few months now and it solves a lot of search and retrieval problems quite easily.