开发者

Python ElementTree Check the node / element type

开发者 https://www.devze.com 2023-01-13 17:41 出处:网络
I am using ElementTree and cannot figure out if the childnode is text or not. childelement.text does not seem to work as it gives false positive even on nodes which are not text nodes.

I am using ElementTree and cannot figure out if the childnode is text or not. childelement.text does not seem to work as it gives false positive even on nodes which are not text nodes.

Any suggestions?

Example

<tr>
  &开发者_高级运维lt;td><a href="sdas3">something for link</a></td>
  <td>tttttk</td>
  <td><a href="tyty">tyt for link</a></td>
</tr>

After parsing this xml file, I do this in Python:

for elem_main in container_trs: #elem_main is each tr
    elem0 = elem_main.getchildren()[0] #td[0]
    elem1 = elem_main.getchildren()[1] #td[1]

    elem0 = elem_main.getchildren()[0]
    print elem0.text

    elem1 = elem_main.getchildren()[1]
    print elem1.text

The above code does not output elem0.text; it is blank. I do see the elem1.text (that is, tttttk) in the output.

Update 2

I am actually building a dictionary. The text from the element with each so that I can sort the HTML table. How would I get the s in this code?


How about using the getiterator method to iterate through the all the descendant nodes:

import xml.etree.ElementTree as xee

content='''
<tr>
  <td><a href="sdas3">something for link</a></td>
  <td>tttttk</td>
  <td><a href="tyty">tyt for link</a></td>
</tr>
'''

def text_content(node):
    result=[]
    for elem in node.getiterator():
        text=elem.text
        if text and text.strip():
            result.append(text)
    return result

container_trs=xee.fromstring(content)
adict={}
for elem in container_trs:
    adict[elem]=text_content(elem)
print(adict)
# {<Element td at b767e52c>: ['tttttk'], <Element td at b767e58c>: ['tyt for link'], <Element td at b767e36c>: ['something for link']}

The loop for elem_main in container_trs: iterates through the children of cantainer_trs.

In contrast, the loop for elem_main in container_trs.getiterator(): iteraters through container_trs itself, and its children, and grand-children, etc.


elem0.text is None because the text is actually part of the <a> subelement. Just go one level deeper:

print elem0.getchildren()[0].text

By the way, elem0[0].text is a shortcut for that same construct -- no need for getchildren().

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号