开发者

How to convert XML Data into a binary deliverable?

开发者 https://www.devze.com 2022-12-13 23:16 出处:网络
We hav开发者_如何学Goe an application that requires loading A LOT of configuration data at startup. The data is stored in a XML File which currently is 40MB but will grow to 100MB and more. This data

We hav开发者_如何学Goe an application that requires loading A LOT of configuration data at startup. The data is stored in a XML File which currently is 40MB but will grow to 100MB and more. This data will change while developing but not between releases.

We are looking for a way to speed up the loading process for a "fixed" set of data and one idea is leading to this question:

What would be the easiest/most efficient way to convert the xml file into something which can be delivered as a binary?

For example we could generate a static class with a lot of 'new objectFromXML (param1, param2, ..., paramn)' lines in it's initialization method or we could use one object with a gigantic array containing the data. All this can be done without too much trouble but I suspect that there are more elegant solutions to our problem. Any comments would be highly appreciated.


protobuf-net can be compatible with both binary (Google's efficient "protocol buffers" format) and xml at the same time on the same class definitions*.

It can even work without any changes if your xml is element based and includes attributes like [XmlElement(Order = 1)] (to work, it needs to be able to find a unique number per property, you see). Note that if you use inheritance ([XmlInclude]) you'll need to add additional attributes (again, to nominate a number - via the similar [ProtoInclude])

Otherwise, you can add additional attributes, and job done; just call Serializer.Serialize.

Result: smaller, faster serialization.

*=and as proof, this is actually how the codegen works: compile the ".proto" DSL to binary ("protoc"), load the binary into the object-model ("protobuf-net"), write as xml (XmlSerializer) , run through xslt to get C#.


The alternative might be to run the xml through an xslt into C# and compile it, but... ugly. I've done this myself when absolutely needed; it was horrible enough to break reflector! (no, really).


My first response is: WHY??? An XML file of 40 MB is already huge. Why even store more data inside it? A good way to handle this much data would be by using a database. SQL Server Express is free to install and will work much faster. If you don't want a full server, the Compact edition of SQL Server might be an option, since it basically allow XCopy deployment.

The only advantage of XML is that it's readable for both machines and humans. With a binary format you will need some additional tool to make it human-readable.

Since you're using C#, I'd just go for the SQL Server Compact edition, with an SQL script that adds plenty of logical relations and constraints on the database. An additional Entity Framework class will make the data even more accessible and the only thing you'd need in some XML configuration file would the the connection string...


But if you're stuck with this XML file, the use of ZLIB has already been suggested to compress the whole file.
And since you're dealing with lots of small configuration files inside a bigger structure, you could -as suggested- use ZLIB to create a ZIP file that contains all those small XML structures as separate files. The filename in the ZIP file would be identifying the class that they're for and by reading the specific XML file from the ZIP file, you will improce performance, since the XML parser only needs to parse a little bit. Even if you would need to read 90% of all those XML files, performance would still be good since you're using lots of small XML documents, where the indices are smaller and searching will take less time.


The idea is to write the data in xml but transform that xml into a bytestream as a build step. You can do it by loading the xml into an in-memory object and then do a binary serialization of that object to a file for example. In production just do a binary deserialization and skip the xml altogether.


If you want to speed up the loading process, compressing the XML is not going to help you. In fact, it will hurt you: instead of simply parsing the XML, your program will have to uncompress it and then parse it.

You really haven't provided very much information about what you're currently doing. Are you currently loading the XML into an XmlDocument or XDocument and then processing it? If so, the simplest way to speed up the load without changing anything else is to implement a load method that uses an XmlReader, which lets you parse and deserialize the data at the same time.

Are you using XML serialization to produce the XML? If so, you can use protocol buffers, as Marc Gravell suggested, or you can implement binary serialization. This assumes that you don't need the XML for any other purpose.

Do you actually need to deserialize all of the configuration data before your program can function? Or is it possible to use some kind of lazy loading method? If you can do lazy loading, choosing some serialization format that lets you break the loading process into chunks that get performed when the program needs them can speed up the apparent performance of your program (if not the actual performance).

I guess the bottom line is: there are dozens of possible approaches to a problem that's defined as "I need to load a lot of data out of an XML document at startup." Define the problem more precisely, and you'll get more useful suggestions.


Ever thought of using a Resource file for this instead of your own home-rolled XML file? This is pretty much what they're made to do.


I ended up using zlib to create a compressed copy of an XML and XSD file in binary format.


If you are looking to turn the XML into some sort of object structure you can hit it from one of two sides. First you could create a XSD for the XML if you are mostly using nodes in the XML such as and then use the XSD.exe tool to generate the code to serialize/deserialize this. The Second option would be to have simple POCO objects setup that match the format of the XML and just use the XmlSerializer to turn the XML into the objects.


VTD-XML has the built-in indexing feature called vtd+xml, the basic idea is that you parse XML into VTD, then persist the VTD along with XML into an indexing file... next time you load up the indexed XML document, you don't have to parse it, which speeds up parsing significantly... see the article below

http://www.codeproject.com/KB/XML/VTD-XML-indexing.aspx

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号