I have the following a large xml file which have entities on the below format : could someone help how can i proccess it with xml::twig ?
<root >
<entity id="1" last_modified="2011-10-1">
<entity_title> title</entity_title>
<entity_description>description </entity_description>
<entity_x> x </entity_x>
<entity_y> x </entity_y>
<entity_childs>
<child flag="1">
<child_name>name<child_name>
<child_type>type1</child_type>
<child_x> some_text</child__x>
</child>
<child flag="1">
<child_name>name1<child_name>
<child_type>type2</child_type>
<child_x> some_text</child__x>
</child>
<entity_sibling>
<family value="1" name="xc">fed</ext_ref>
<family value="1" name=开发者_如何学C"df">ff</ext_ref>
</entity_sibling>
<\root>
;
I run the below code and get out of memory !
my $file = shift ||die $!;
my $twig = XML::Twig->new();
my $config = $twig->parsefile( $file )->simplify();
print Dumper( $config );
XML::Twig is able to run in two modes, for small or for large documents. You say it's large, so you want the second approach listed in the documentation synopsis.
The example for processing huge documents goes like this:
# at most one div will be loaded in memory
my $twig=XML::Twig->new(
twig_handlers =>
{ title => sub { $_->set_tag( 'h2') }, # change title tags to h2
para => sub { $_->set_tag( 'p') }, # change para to p
hidden => sub { $_->delete; }, # remove hidden elements
list => \&my_list_process, # process list elements
div => sub { $_[0]->flush; }, # output and free memory
},
pretty_print => 'indented', # output will be nicely formatted
empty_tags => 'html', # outputs <empty_tag />
);
$twig->flush; # flush the end of the document
So I think you want to use that method, not the one you're currently using which is noted as only for small documents.
Yep, there is no magic in XML::Twig, if you write $twig->parsefile( $file )->simplify();
then it will load the entire document in memory. I am afraid you will have to put some work into it to get just the bits you want and discard the rest. Look at the synopsys or
the XML::Twig 101 section at the top of the docs for more information.
This is becoming a FAQ, so I have added the blurb above to the docs of the module.
In this particular case you probably want to set a handler (using the twig_handlers
option) on entity
, process each entity and then discard it by using flush
if you are updating the file, or purge
if you just want to extract data from it.
So the architecture of the code should look like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $file = shift;
my $twig=XML::Twig->new( twig_handlers => { entity => \&process_entity },)
->parsefile( $file);
exit;
sub process_entity
{ my( $t, $entity)= @_;
# do what you have to do with $entity
$t->purge;
}
精彩评论