I'm trying to parse some data that has every cell of a tabs in a <text />
node. I need to ignore nodes that start with the star character *
as well as 4 nodes after it. Can this be done with xpath, or do I need to go about this in a different way?
EDIT: My XML looks like the following:
<page>
<text attr="123" attr2="1234">ROW 1 CELL 1</text>
<text attr="123" attr2="1234">ROW 1 CELL 2</text>
<text attr="123" attr2="1234">ROW 1 CELL 3</text>
<text attr="123" attr2="1234">ROW 1 CELL 4</text>
<text attr="123" attr2="1234">ROW 1 CELL 5</text>
<text attr="123" attr2="1234">* ROW 2 CELL 1</text>
<text attr="123" attr2="1234">ROW 2 CELL 2</text>
<text attr="123" 开发者_开发百科attr2="1234">ROW 2 CELL 3</text>
<text attr="123" attr2="1234">ROW 2 CELL 4</text>
<text attr="123" attr2="1234">ROW 2 CELL 5</text>
<text attr="123" attr2="1234">ROW 3 CELL 1</text>
<text attr="123" attr2="1234">ROW 3 CELL 2</text>
<text attr="123" attr2="1234">ROW 3 CELL 3</text>
<text attr="123" attr2="1234">ROW 3 CELL 4</text>
<text attr="123" attr2="1234">ROW 3 CELL 5</text>
</page>
The following expression:
/*/text[not(starts-with(., '*')) and
not(preceding::*[position()<5][starts-with(., '*')])]
Selects the following against your input:
<root>
<text attr="123" attr2="1234">ROW 1 CELL 1</text>
<text attr="123" attr2="1234">ROW 1 CELL 2</text>
<text attr="123" attr2="1234">ROW 1 CELL 3</text>
<text attr="123" attr2="1234">ROW 1 CELL 4</text>
<text attr="123" attr2="1234">ROW 1 CELL 5</text>
<text attr="123" attr2="1234">ROW 3 CELL 1</text>
<text attr="123" attr2="1234">ROW 3 CELL 2</text>
<text attr="123" attr2="1234">ROW 3 CELL 3</text>
<text attr="123" attr2="1234">ROW 3 CELL 4</text>
<text attr="123" attr2="1234">ROW 3 CELL 5</text>
</root>
All of ROW 2
is skipped.
The following expression is equivalent (by De Morgan's laws):
/*/text[not(starts-with(., '*') or
preceding::*[position()<5][starts-with(., '*')])]
This will work for you
//text[starts-with(.,"*")]/preceding-sibling::text
| //text[starts-with(.,"*")]/following-sibling::text[position() > 4]
For the provided input this returns the desired nodes
<text attr="123" attr2="1234">ROW 1 CELL 1</text>
<text attr="123" attr2="1234">ROW 1 CELL 2</text>
<text attr="123" attr2="1234">ROW 1 CELL 3</text>
<text attr="123" attr2="1234">ROW 1 CELL 4</text>
<text attr="123" attr2="1234">ROW 1 CELL 5</text>
<text attr="123" attr2="1234">ROW 3 CELL 1</text>
<text attr="123" attr2="1234">ROW 3 CELL 2</text>
<text attr="123" attr2="1234">ROW 3 CELL 3</text>
<text attr="123" attr2="1234">ROW 3 CELL 4</text>
<text attr="123" attr2="1234">ROW 3 CELL 5</text>
However as @lwburk points out in comments it does not work for the general case if you have multiple nodes that begin with *. This is because the |
operator paired with the two statements ends up selecting everything before and after both matching nodes. His solution correctly handles both situations.
精彩评论