开发者

How can I read malformed XML (unencoded entities) with Perl?

开发者 https://www.devze.com 2022-12-28 21:49 出处:网络
I\'m trying to parse an XML file I get from an external source but am havin开发者_JAVA技巧g problems because there are unencoded XML entities in the text nodes.

I'm trying to parse an XML file I get from an external source but am havin开发者_JAVA技巧g problems because there are unencoded XML entities in the text nodes.

Essentially, I'm asking the same question as this, but for Perl instead of PHP.

<report>  
  <company>A & W</company>  
  <company>Some Other Company with a < in Inc.</company>
</report>  

I tried using something like this:

my $readAllRecordsURI = "http://mycompany.com/CompanyOnline/GetRecord";
my @form_array = ("action" => "readAll", "table" => "QOPIDINF");

my $ua = LWP::UserAgent->new;

my $cics_request = (POST $readAllRecordsURI, \@form_array);          
my $cics_response = $ua->request($cics_request);
my $xmlfile = $cics_response->content;

my $parser = XML::Parser->new( Handlers => {Char  => \&handle_char});
$parser->parsefile( $xmlfile );


sub handle_char {
   my ($p, $string) = @_;

   #clean up text here...
}


This really isn't the answer, but it solves my problem. What I've done is gone back to the programmer that provided the XML and asked him to have it encode the text properly to avoid all this.


XML::Parser / Expat has always worked well for me, including with poorly formed XML.

Do NOT parse XML with a regex.... unless your parser does not work >;-} ... Can you just deleted the company name with a < in it before parsing?

Here are some regexs to try: XML Shallow Parsing with regex -- At the bottom of that page I think there is a regex that will find only correct XML tags; invert that to find poorly formed?


Take a look at XML::Liberal. It appears to do just what you want. A very simple example (from one of the unit tests):

my $clean_xml = XML::Liberal->new('LibXML')->parse_string($bad_xml)->to_string()
0

精彩评论

暂无评论...
验证码 换一张
取 消