I've got a Perl script which consumes an XML file on Linux and occasionally there开发者_开发百科 are CRLF (Hex 0D0A, Dos new lines) in some of the node values which.
The system which produces the XML file writes it all as a single line, and it looks as if it occasionally decides that this is too long and writes a CRLF into one of the data elements. Unfortunately there's nothing I can do about the providing system.
I just need to remove these from the string before I process it.
I've tried all sorts of regex replacement using the perl char classes, hex values, all sorts and nothing seems to work.
I've even run the input file through dos2unix before processing and I still can't get rid of the erroneous characters.
Does anyone have any ideas?
Many Thanks,
Typical, After battling for about 2 hours, I solved it within 5 minutes of asking the question..
$output =~ s/[\x0A\x0D]//g;
Finally got it.
$output =~ tr/\x{d}\x{a}//d;
These are both whitespace characters, so if the terminators are always at the end, you can right-trim with
$output =~ s/\s+\z//;
A few options:
1. Replace all occurrences of cr/lf with lf: $output =~ s/\r\n/\n/g; #instead of \r\n might want to use \012\015
2. Remove all trailing whitespace: output =~ s/\s+$//g;
3. Slurp and split:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
sub main{
createfile();
outputfile();
}
main();
sub createfile{
(my $file = $0)=~ s/\.pl/\.txt/;
open my $fh, ">", $file;
print $fh "1\n2\r\n3\n4\r\n5";
close $fh;
}
sub outputfile{
(my $filei = $0)=~ s/\.pl/\.txt/;
(my $fileo = $0)=~ s/\.pl/out\.txt/;
open my $fin, "<", $filei;
local $/; # slurp the file
my $text = <$fin>; # store the text
my @text = split(/(?:\r\n|\n)/, $text); # split on dos or unix newlines
close $fin;
local $" = ", "; # change array scalar separator
open my $fout, ">", $fileo;
print $fout "@text"; # should output numbers separated by comma space
close $fout;
}
精彩评论