I have a bunch of XML files that are about 1-2 megabytes in size. Actually, more than a bunch, there are millions. They're all well-formed and many are even validated against their schema (confirmed with libxml2).
All were created by the same app, so they're in a开发者_运维知识库 consistent format (though this could theoretically change in the future).
I want to check the values of one element in each file from within a Perl script. Speed is important (I'd like to take less than a second per file) and as noted I already know the files are well-formed.
I am sorely tempted to simply 'open' the files in Perl and scan through until I see the element I am looking for, grab the value (which is near the start of the file), and close the file.
On the other hand, I could use an XML parser (which might protect me from future changes to the XML formatting) but I suspect it will be slower than I'd like.
Can anyone recommend an appropriate approach and/or parser?
Thanks in advance.
Update
Here's the structure/complexity of the data I am trying to pull out:
<doc>
...
<someparentnode attrib="notme" attrib2="5">
<node>Not this one</node>
</someparentnode>
<someparentnode attrib="pickme" attrib2="5">
<node>This is the data I want</node>
</someparentnode>
<someparentnode attrib="notme"
attrib2="reallyreallylonglineslikethisonearewrapped">
<node>Not this one either and it may be
wrapped too.</node>
</someparentnode>
...
</doc>
The hierarchy goes a several levels deeper than that, but I think that covers off the sorts of things I am trying to do.
2 stand-alone XML-aware options (which I wrote, so I might be biased ;--) are xml_grep
(included with XML::Twig) and xml_grep2
(in App::xml_grep2).
You would write xml_grep -t '*[@attrib="pickme"]' *.xml
or xml_grep2 -t '//*[@attrib="pickme"]' *.xml
(the -t
option gives you the result as text instead of XML).
Also in both cases all of the documents will be parsed, but the next version of xml_grep
will add an option to limit the number of results per file, and to stop parsing each file as soon as this number is reached.
Otherwise, if you need speed and if the code needs to be integrated, you can use XML::Twig, with a handler triggered on the element(s) you want, and a call to finish_now
when you've found it, which will abort parsing and let go on to the next file.
XML::LibXML is also an option, although you will then have to parse completely each document and use XPath (easy but might be slower), use SAX (may be faster but is painful to code) or use the pull-parser (probably the best option but I have never used it).
Update after your update: the code with XML::Twig would look like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig= XML::Twig->new( twig_handlers => { '*[@attrib="pickme"]' => \&pickme });
foreach my $file (@ARGV)
{ $twig->parsefile( $file); }
sub pickme
{ my( $twig, $node)= @_;
print $node->text, "\n";
$twig->finish_now;
}
If you want to do it fast, I would recommend you use XML::Bare instead of XML::Simple or XML::Twig.
I'm using it to parse through several 2-5Mb XML files and the speedup is amazing: 0.2 seconds vs 4 minutes, in some cases. Details here: http://darkpan.com/files/xml-parsing-perl-gripes.txt.
Awk
awk 'BEGIN{
RS="</doc>"
FS="</someparentnode>"
}
{
for(i=1;i<=NF;i++){
if( $i~/pickme/){
m=split($i,a,"</node>")
for(o=1;o<=m;o++){
if(a[o]~/<node>/){
gsub(/.*<node>/,"",a[o])
print a[o]
}
}
}
}
}' file
Perl
#!/usr/bin/perl
$/ = '</doc>';
$FS = '</someparentnode>';
while (<>) {
chomp;
@F = split $FS,;
for ($i=0;$i<=$#F; $i++) {
if ($F[$i] =~ /pickme/) {
$M=(@a=split('</node>', $F[$i]));
for ($o=0; $o<$M; $o++) {
if ($a[$o]=~/<node>/) {
$a[$o] =~ s/.*<node>//sg;
print $a[$o];
}
}
}
}
}
output
$ perl script.pl file
This is the data I want
$ ./shell.sh
This is the data I want
精彩评论