Xpath to exclude nodes after match_问答_开发者

开发者 https://www.devze.com 2023-03-20 04:48 出处：网络

I\'m trying to parse some data that has every cell of a tabs in a <text /> node. I need to ignore nodes that start with the star character * as well as 4 nodes after it. Can this be done with xp

相关专题：xml

I'm trying to parse some data that has every cell of a tabs in a <text /> node. I need to ignore nodes that start with the star character * as well as 4 nodes after it. Can this be done with xpath, or do I need to go about this in a different way?

EDIT: My XML looks like the following:

<page>
    <text attr="123" attr2="1234">ROW 1 CELL 1</text>
    <text attr="123" attr2="1234">ROW 1 CELL 2</text>
    <text attr="123" attr2="1234">ROW 1 CELL 3</text>
    <text attr="123" attr2="1234">ROW 1 CELL 4</text>
    <text attr="123" attr2="1234">ROW 1 CELL 5</text>
    <text attr="123" attr2="1234">* ROW 2 CELL 1</text>
    <text attr="123" attr2="1234">ROW 2 CELL 2</text>
    <text attr="123" 开发者_开发百科attr2="1234">ROW 2 CELL 3</text>
    <text attr="123" attr2="1234">ROW 2 CELL 4</text>
    <text attr="123" attr2="1234">ROW 2 CELL 5</text>
    <text attr="123" attr2="1234">ROW 3 CELL 1</text>
    <text attr="123" attr2="1234">ROW 3 CELL 2</text>
    <text attr="123" attr2="1234">ROW 3 CELL 3</text>
    <text attr="123" attr2="1234">ROW 3 CELL 4</text>
    <text attr="123" attr2="1234">ROW 3 CELL 5</text>
</page>

The following expression:

 /*/text[not(starts-with(., '*')) and 
         not(preceding::*[position()<5][starts-with(., '*')])]

Selects the following against your input:

<root>
  <text attr="123" attr2="1234">ROW 1 CELL 1</text>
  <text attr="123" attr2="1234">ROW 1 CELL 2</text>
  <text attr="123" attr2="1234">ROW 1 CELL 3</text>
  <text attr="123" attr2="1234">ROW 1 CELL 4</text>
  <text attr="123" attr2="1234">ROW 1 CELL 5</text>
  <text attr="123" attr2="1234">ROW 3 CELL 1</text>
  <text attr="123" attr2="1234">ROW 3 CELL 2</text>
  <text attr="123" attr2="1234">ROW 3 CELL 3</text>
  <text attr="123" attr2="1234">ROW 3 CELL 4</text>
  <text attr="123" attr2="1234">ROW 3 CELL 5</text>
</root>

All of ROW 2 is skipped.

The following expression is equivalent (by De Morgan's laws):

/*/text[not(starts-with(., '*') or 
            preceding::*[position()<5][starts-with(., '*')])]

This will work for you

//text[starts-with(.,"*")]/preceding-sibling::text 
| //text[starts-with(.,"*")]/following-sibling::text[position() > 4]

For the provided input this returns the desired nodes

<text attr="123" attr2="1234">ROW 1 CELL 1</text>
<text attr="123" attr2="1234">ROW 1 CELL 2</text>
<text attr="123" attr2="1234">ROW 1 CELL 3</text>
<text attr="123" attr2="1234">ROW 1 CELL 4</text>
<text attr="123" attr2="1234">ROW 1 CELL 5</text>
<text attr="123" attr2="1234">ROW 3 CELL 1</text>
<text attr="123" attr2="1234">ROW 3 CELL 2</text>
<text attr="123" attr2="1234">ROW 3 CELL 3</text>
<text attr="123" attr2="1234">ROW 3 CELL 4</text>
<text attr="123" attr2="1234">ROW 3 CELL 5</text>

However as @lwburk points out in comments it does not work for the general case if you have multiple nodes that begin with *. This is because the | operator paired with the two statements ends up selecting everything before and after both matching nodes. His solution correctly handles both situations.