Parsing XML file with perl - regex_问答_开发者

开发者 https://www.devze.com 2023-01-02 07:16 出处：网络

i\'m just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one.

相关专题：perl regex

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

    <article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>

What i'd like to d开发者_StackOverflow中文版o is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings;
use strict;
use XML::LibXML;

my $doc   = XML::LibXML->load_xml(location => 'articles.xml');
my $xp    = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath) ) {
  # now do something with $article
  print $article.": ".$article->getName."\n";
}

For me this prints:

XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:

The type of $doc will be XML::LibXML::Document.
The type of $xp is XML::LibXML::XPathContext.
The return type of $xp->findnodes() is XML::LibXML::NodeList.
The type $article is XML::LibXML::Element.

Original version of the answer, based on the XML::XPath package:

use warnings;
use strict;
use XML::XPath;

my $xp    = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
  # now do something with $article
  print $article.": ".$article->getName ."\n";
}

which prints this for me:

XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article

The type of $xp is XML::XPath, obviously.
The return type of $xp->findnodes() is XML::XPath::NodeSet.
The type of $article will be XML::XPath::Node::Element in this case.

Have a look at the docs to find out what you can do with them.

Here:

 open my $input, "<", "file.xml" or die $!;
 open my $output, ">", "truncated-file.xml" or die $!;
 my $n_articles = 0;
 while (<$input>) {
      print $output $_;
      if (m:</article>:) {
           $n_articles++;
           if ($n_articles >= 3) {
                last;
           }
      }
 }         
 close $input or die $!;
 close $output or die $!;

You really don't need an XML parser to do such a simple job.