开发者

Most efficient data structure to store an XML tree in C++

开发者 https://www.devze.com 2023-02-25 21:42 出处:网络
I\'m doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don\'t just tell me what you\'ve heard of in the past; I\'d like to know what t

I'm doing some work with XML in C++, and I would like to know what the best data structure to store XML data is. Please don't just tell me what you've heard of in the past; I'd like to know what the most efficient structure is. I would like to be able to store any arbitrary XML tree (assuming it is valid), with minimal memory overhead and lookup time.

My initial t开发者_如何学JAVAhought was a hash, but I couldn't figure out how to handle multiple children of the same tag, as well as how attributes would be handled.

Qt solutions are acceptable, but I'm more concerned with the overall structure than the specific library. Thanks for your input.


The most efficient structure would a set of classes derived from the DTD or the Schema that defines the particular XML instances you intend to process. (Surely you aren't going to process arbitrary XML?) Tags are represented by classes. Single children can be represented by fields. Childen with min...max arity can be represented by a field containing an array. Children with indefinite arity can be represented by a dynamically allocated array. Attributes and children can be stored as fields, often with an inferred data type (if an attribute represents a number, why store it as a string?). Using this approach, you can often navigate to a particular place in an XML document using native C++ accesspaths, e.g., root->tag1.itemlist[1]->description.

All of the can be generated automatically from the Schema or the DTD. There are tools to do this. Altova offers some. I have no specific experience with this (although I have built similar tools for Java and COBOL).


You should first determine what the requirement for efficiency is, in terms of storage, speed etc. in concrete numbers. Without knowing this information, you can't tell if your implementation satisfies the requirement.

And, if you have this requirement, you will probably find that the DOM satisfies it, and has the advantage of zero code to maintain.

It will be a nightmare for future programmers as they wonder why someone wrote an alternate implementation of the DOM.

Actually, pretty much anything you do will just be a DOM implementation, but possibly incomplete, and with optimizations for indexing etc. My personal feelig is that re-inventing the wheel should be the last thing you consider.


there is a C++ XML library already built: xerces. http://xerces.apache.org/xerces-c/install-3.html

there are some tree structures in \include\boost-1_46_1\boost\intrusive\ there is a red-black and an avl tree, but not having looked at those in a long time, I don't know if those are especially usable, I think not.

XML is a tree structure. you don't know what the structure is going to be unless it has a DTD defined and included in a (although the validator at validrome breaks on !DOCTYPEs and it shouldn't).

see http://w3schools.com/xml/xml_tree.asp for a tree example.

you may get something that doesn't follow a DTD or schema. totally unstructured. like this:

<?xml version="1.0"?>
<a>
 <b>hello
  <e b="4"/>
  <c a="mailto:jeff@nowhere.com">text</c>
 </b>
 <f>zip</f>
 <z><b /><xy/></z>
 <zook flag="true"/>
 <f><z><e/></z>random</f>
</a>

I know that queriable XML databases do exist, but I don't know much about them, except that they can handle unstructured data.

PHP has an XML parser which sticks it into what PHP calls an array (not quite like a C/C++ array, because the arrays can have arrays), you can tinker with it to see an example of what an XML data structure should have in it.

what you basically want is a very flexible tree where the root pointer points to a list. each of those nodes in the list contains a pointer that can point to a list. it should be an ordered list, so make it a . If your purpose is to be able to remove data, use a instead of a - it's ordered, while having the capability of easy manipulation.

word of warning: .erase(iterator i) erases everything starting at and after i. .erase(iterator i1, iterator i2) erases everything from i1 up to but not including i2. .end() is an iterator that points 1 after the end of the list, essentially at nothing. .begin() is an iterator that points to the start of the list.

learn to use for_each(start,end,function) { } in or use a regular for statement.

iterators are like pointers. treat them as such.

#include <iterator>
#include <list>
#include <iostream>
using namespace std;
list<class node> nodelist;
list<class node>::iterator nli;
for (nli=nodelist.begin(); nli!=nodelist.end(); nli++) {
    cout<<nli->getData()<<endl;
}

the nodes need to have an optional list of attributes and note that the DTD could possibly be contained within the XML document, so you have to be able to read it to parse the document (or you could throw it away). you may also run into XML Schema, the successor of the DTD.


I think the most efficient data struture to store xml in is probably vtd-xml, which uses array of longs instead of lots of interconnected structs/classes. The main idea is that structs/classes are based on small memory allocators which incurs severe overhead in a normal circumstance. See this article for further detail.

http://soa.sys-con.com/node/250512


I'm not sure what the most efficient method is, but since the DOM already exists why re-invent the wheel?

It may make sense to hash all nodes by name for lookup, but you should still use the DOM as the basic representation.


I've been exploring this problem myself. And, these are my thoughts.

a) every element in xml is either a node or a (key, value) pair. b) store every element in a Hash. assign each element a type i.e "node","key,value". c)every element will have a parent. assign a value to each of them. d) every element may, or may, not have children/References. store the children in a btree which will define, the references.

Search time for any key will be O(1).A reference traversal can have a list of all the children inside the element.

Please review and suggest what I've missed.


Just use DOM to store the parsed XML file . Surely there are C++ DOM library . You can query DOM with XPath expressions.

0

精彩评论

暂无评论...
验证码 换一张
取 消