This is driving me mental, and I've probably been hacking away at it for to long so would appreciate some help to prevent lose of/restore my sanity! The food based xml is only an example of what I wish to achieve.
I have the following file which I am trying to put into a graph, so wheat and fruit are parents with a depth of 0. Indian is a child of wheat with a depth of 1 and so on and so on.
Each of the layers has some keywords. So what I want to get out is
layer, depth, parent, keywords
wheat, 1, ROOT, [bread, pita, narn, loaf]
indian, 2, wheat [chapati]
mumbai, 3, indian, puri
fruit, 1,ROOT, [apple, orange, pear, lemon]
This is a sample file -
<keywords>
<layer id="wheat">
<layer id="indian">
<keyword>chapati</keyword>
<layer id="mumbai">
<keyword>puri</keyword>
</layer>
</layer>
<keyword>bread</keyword>
<keyword>pita</keyword>
<keyword>narn</keyword>
<keyword>loaf</keyword>
</layer>
<layer id="fruit">
<keyword>apple</keyword>
<keyword>orange</keyword>
<keyword>pear</keyword>
<keyword>lemon</keyword>
</layer>
</keywords>
So this isnt a graph question, I can do that bit thats easy. What im struggling with is parsing the XML.
If I do a
xmldoc = minidom.parse(self.filename)
layers = xmldoc.getElementsByTagName('layer')
layers only returns all of the layer elements, which is to much and has not concept of depth/ hierachy as far as I can understand as it does a recursive search.
The following post is good, but doesnt provide the concepts I require. XML Parsing with Python and minidom. Can anyone help with how I might go about this? I can post my code but its so hacked tog开发者_JS百科ether/fundementally broken I don't think it would be use to man nor beast!
Cheers
Dave
Use lxml. In particular, XPath. You can get all layer
elements, regardless of level, through "//layer"
and the layer
with the id id
through "//layer[id='{}'][0]".format(id)
. The keyword
elements directly under an element (or several elements) by ".../keyword"
(where ...
is a query that yields the nodes whose descendants should be searched).
Getting the depth of a given node is not quite as trivial, but still easy. I didn't find an existing function (afaik, this is outside the domain of XPath - athough you can check for the depth in a query, you only return elements, i.e. you can return nodes with a specific depth but not the depth itself), so here's a hand-rolled one (no recursion, since it's not necessary - but in general, working with XML means working with recursion, like it or not!):
def depth(node):
depth = 0
while node.getparent() is not None:
node = node.getParent()
depth += 1
return depth
Something very similar is possible with DOM, if you should be foolish enough not to use the best Python XML library in existence ;)
Here's a solution with ElementTree:
from xml.etree import ElementTree as ET
from io import StringIO
from collections import defaultdict
data = '''\
<keywords>
<layer id="wheat">
<layer id="indian">
<keyword>chapati</keyword>
<layer id="mumbai">
<keyword>puri</keyword>
</layer>
</layer>
<keyword>bread</keyword>
<keyword>pita</keyword>
<keyword>narn</keyword>
<keyword>loaf</keyword>
</layer>
<layer id="fruit">
<keyword>apple</keyword>
<keyword>orange</keyword>
<keyword>pear</keyword>
<keyword>lemon</keyword>
</layer>
</keywords>
'''
path = ['ROOT'] # stack for layer names
items = defaultdict(list) # key=layer, value=list of items @ layer
f = StringIO(data)
for evt,e in ET.iterparse(f,('start','end')):
if evt == 'start':
if e.tag == 'layer':
path.append(e.attrib['id']) # new layer added to path
elif e.tag == 'keyword':
items[path[-1]].append(e.text) # add item to last layer in path
elif evt == 'end':
if e.tag == 'layer':
layer = path.pop()
parent = path[-1]
print layer,len(path),parent,items[layer]
Output
mumbai 3 indian ['puri']
indian 2 wheat ['chapati']
wheat 1 ROOT ['bread', 'pita', 'narn', 'loaf']
fruit 1 ROOT ['apple', 'orange', 'pear', 'lemon']
You can either recursively walk the DOM treje (see kelloti's answer) or determine the info from the found nodes:
xmldoc = minidom.parse(filename)
layers = xmldoc.getElementsByTagName("layer")
def _getText(node):
rc = []
for n in node.childNodes:
if n.nodeType == n.TEXT_NODE:
rc.append(n.data)
return ''.join(rc)
def _depth(n):
res = -1
while isinstance(n, minidom.Element):
n = n.parentNode
res += 1
return res
for l in layers:
keywords = [_getText(k) for k in l.childNodes
if k.nodeType == k.ELEMENT_NODE and k.tagName == 'keyword']
print("%s %s %s" % (l.getAttribute("id"), _depth(l), keywords))
Try iterating through all child nodes in a recursive function, checking each for tag name. i.e.
def findLayer(node):
for n in node.childNodes:
if n.localName == 'layer':
findLayer(n)
# do things here
Alternately, try using a different XML library like Amara or lxml that has XPath capabilities. With XPath you can have much more control for searching the DOM tree with very little code.
精彩评论