I have a gigantic (4GB) XML file that I am currently breaking into chunks with linux "split" function (every 25,000 lines - not by bytes). This usually works great (I end up with about 50 files), except some of the data descriptions have line breaks, and so frequently the chunk files do not have the proper closing tags - and my parser chokes halfway through processing.
Example file: (note: normally each "listing" xml node is supposed to be on its own line)
<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</listings>
Then sometimes my split ends up like
<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks ...
EOF
So - I have been readi开发者_Go百科ng about "csplit" and it sounds like it might work to solve this issue. I cant seem to get the regular expression right...
Basically I want the same output of ~50ish files
Something like:
*csplit -k myfile.xml '/</listing>/' 25000 {50}
Any help would be great Thanks!
You can't get a valid XML file this way. I would recommend that you write a java program using StaX, which, if you use the WoodStox implementation, will go really quite fast streaming the XML in and out.
I would recommend against trying to use regexps (or naive text matching) for any xml manipulation, including splitting. XML is tricky enough to deal with that parser should be used; and due to memory limitations, one that can do "streaming" (aka incremental / chunked) parsing. I am most familiar with Java, where you would use Stax (or SAX) parser and writer/generator to do this; most other languages have something similar. Or if input is regular enough, data binding tool (JAXB) that can bind subtrees.
Doing it right way may be bit more work, but would actually work, dealing with things xml can have (for example, CDATA sections can not be split; regexp solutions invariably have cases they wouldn't handle, until one has basically written a full xml parser).
Use perl:
perl -p -i -e 'unless(defined$fname){$fname="xx00";open$fh,">",$fname;}$size+=length;print$fh $_;if($size>%MAX% and m@</listing>@){$fname++;$size=0;open$fh,">",$fname;}'
Replace %MAX%
with maximum size of one file in bytes.
First of all, you use a slash inside the regexp. To be safe you might want to quote it so that it won't be confused with the end delimiter: /<\/listing>/
.
However, in this case it would be more convenient to split on the start tag rather than end tag, since each chunk contains up to but not including the matching line. So you might try something like this:
csplit myfile.xml '/^<listing>/' '{*}'
Used the beginning-of-line anchor ^
there to make sure it only splits before lines where the start tag appears at the beginning of the line.
I wouldn't use csplit, I would use the perl program xml_split instead, its very nice
$ ls -h .
junk.xml
$ cat junk.xml
<?xml version="1.0" encoding="UTF-8"?>
<listings>
<listing><date>2009-09-22</date><desc>This is a description WITHOUT line breaks and works fine with split</desc><more_tags>stuff</more_tags></listing>
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</listings>
$ xml_split -s 20 junk.xml
$ ls -h .
junk-00.xml junk-01.xml junk-02.xml junk.xml
$ cat junk-00.xml
<listings>
<?merge subdocs = 0 :junk-01.xml?>
<?merge subdocs = 0 :junk-02.xml?>
</listings>
$ cat junk-02.xml
<?xml version="1.0" encoding="UTF-8"?>
<xml_split:root xmlns:xml_split="http://xmltwig.com/xml_split">
<listing><date>2009-09-22</date><desc>This is a really
annoying description field
WITH line breaks
that screw the split function</desc><more_tags>stuff</more_tags></listing>
</xml_split:root>
Ok, so the -s option splits on size not number of lines (elements), but https://metacpan.org/pod/distribution/XML-Twig/tools/xml_split/xml_split can be easily patched to split on every 25k subelements
Having run into the same requirement ( to split a big XML file on the closure of top level child elements but in chunks ), I don't think csplit can achieve this if it only works as described in it's man page.
To be able to do this it would need..
- The ability to group patterns and repeat a group, not just a single pattern
- The ability to have a pattern that captured but did not split off a new file
That would enable a group like
tail bigfile.xml -n-1 | head -n+1 | csplit - '{ 25000 /<\/end>/ }' {*}
I see neither of these features described in it's man page (but think they would be useful additions).
精彩评论