xml parse without the recursive search python_问答_开发者

This is driving me mental, and I've probably been hacking away at it for to long so would appreciate some help to prevent lose of/restore my sanity! The food based xml is only an example of what I wish to achieve.

I have the following file which I am trying to put into a graph, so wheat and fruit are parents with a depth of 0. Indian is a child of wheat with a depth of 1 and so on and so on.

Each of the layers has some keywords. So what I want to get out is

layer, depth, parent, keywords
wheat, 1, ROOT, [bread, pita, narn, loaf]  
indian, 2, wheat [chapati]
mumbai, 3, indian, puri 
fruit, 1,ROOT, [apple, orange, pear, lemon]

This is a sample file -

<keywords>
    <layer id="wheat">
        <layer id="indian">
            <keyword>chapati</keyword>
            <layer id="mumbai">
                <keyword>puri</keyword>
            </layer>
        </layer>
        <keyword>bread</keyword>
        <keyword>pita</keyword>
        <keyword>narn</keyword>
        <keyword>loaf</keyword>
    </layer>
    <layer id="fruit">
        <keyword>apple</keyword>
        <keyword>orange</keyword>
        <keyword>pear</keyword>
        <keyword>lemon</keyword>
    </layer>

</keywords>

So this isnt a graph question, I can do that bit thats easy. What im struggling with is parsing the XML.

If I do a

xmldoc = minidom.parse(self.filename)

layers = xmldoc.getElementsByTagName('layer')

layers only returns all of the layer elements, which is to much and has not concept of depth/ hierachy as far as I can understand as it does a recursive search.

The following post is good, but doesnt provide the concepts I require. XML Parsing with Python and minidom. Can anyone help with how I might go about this? I can post my code but its so hacked tog开发者_JS百科ether/fundementally broken I don't think it would be use to man nor beast!

Cheers

Dave

Use lxml. In particular, XPath. You can get all layer elements, regardless of level, through "//layer" and the layer with the id id through "//layer[id='{}'][0]".format(id). The keyword elements directly under an element (or several elements) by ".../keyword" (where ... is a query that yields the nodes whose descendants should be searched).

Getting the depth of a given node is not quite as trivial, but still easy. I didn't find an existing function (afaik, this is outside the domain of XPath - athough you can check for the depth in a query, you only return elements, i.e. you can return nodes with a specific depth but not the depth itself), so here's a hand-rolled one (no recursion, since it's not necessary - but in general, working with XML means working with recursion, like it or not!):

def depth(node):
    depth = 0
    while node.getparent() is not None:
        node = node.getParent()
        depth += 1
    return depth

Something very similar is possible with DOM, if you should be foolish enough not to use the best Python XML library in existence ;)

Here's a solution with ElementTree:

from xml.etree import ElementTree as ET
from io import StringIO
from collections import defaultdict

data = '''\
<keywords>
    <layer id="wheat">
        <layer id="indian">
            <keyword>chapati</keyword>
            <layer id="mumbai">
                <keyword>puri</keyword>
            </layer>
        </layer>
        <keyword>bread</keyword>
        <keyword>pita</keyword>
        <keyword>narn</keyword>
        <keyword>loaf</keyword>
    </layer>
    <layer id="fruit">
        <keyword>apple</keyword>
        <keyword>orange</keyword>
        <keyword>pear</keyword>
        <keyword>lemon</keyword>
    </layer>
</keywords>
'''

path = ['ROOT']  # stack for layer names
items = defaultdict(list)  # key=layer, value=list of items @ layer

f = StringIO(data)
for evt,e in ET.iterparse(f,('start','end')):
    if evt == 'start':
        if e.tag == 'layer':
            path.append(e.attrib['id']) # new layer added to path
        elif e.tag == 'keyword':
            items[path[-1]].append(e.text) # add item to last layer in path
    elif evt == 'end':
        if e.tag == 'layer':
            layer = path.pop()
            parent = path[-1]
            print layer,len(path),parent,items[layer]

Output

mumbai 3 indian ['puri']
indian 2 wheat ['chapati']
wheat 1 ROOT ['bread', 'pita', 'narn', 'loaf']
fruit 1 ROOT ['apple', 'orange', 'pear', 'lemon']

You can either recursively walk the DOM treje (see kelloti's answer) or determine the info from the found nodes:

xmldoc = minidom.parse(filename)
layers = xmldoc.getElementsByTagName("layer")

def _getText(node):
    rc = []
    for n in node.childNodes:
        if n.nodeType == n.TEXT_NODE:
            rc.append(n.data)
    return ''.join(rc)

def _depth(n):
    res = -1
    while isinstance(n, minidom.Element):
        n = n.parentNode
        res += 1
    return res

for l in layers:
    keywords = [_getText(k) for k in l.childNodes
                if k.nodeType == k.ELEMENT_NODE and k.tagName == 'keyword']
    print("%s %s %s" % (l.getAttribute("id"), _depth(l), keywords))

Try iterating through all child nodes in a recursive function, checking each for tag name. i.e.

def findLayer(node):
    for n in node.childNodes:
        if n.localName == 'layer':
            findLayer(n)
            # do things here

Alternately, try using a different XML library like Amara or lxml that has XPath capabilities. With XPath you can have much more control for searching the DOM tree with very little code.