Trying to understand Perl split() output_问答_开发者

Trying to understand Perl split() output

开发者 https://www.devze.com 2023-03-07 22:11 出处：网络

I have a few lines of text that I\'m trying to use Perl\'s split function to convert into an array. The problem is that I\'m getting some unusual extra characters in the output, specifically the follo

相关专题：perl

I have a few lines of text that I'm trying to use Perl's split function to convert into an array. The problem is that I'm getting some unusual extra characters in the output, specifically the following string "\cM" (without the quotes). This string appears where there were line breaks in the original text; however, (I believe) those line breaks were removed in the text that I'm trying to split. Does anybody know what's going on with this phenomenon? I posted an example below. Thanks.

Here's the original plain text that I'm trying to split. I'm loading it from a file, in case that matters:

10b2obo12b2o2b$6b3obob3o8bob3o2b$2bobo10bo3b2obo4bo2b$2o4b2o5bo3b4obo
3b2o2b$2bob2o2bo4b3obo5b4obob$8bo4bo13b3o$2bob2o2bo4b3obo5b4obob$2o4b
2o5bo3b4obo3b2o2b$2bo开发者_StackOverflow中文版bo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!

Here is my Perl code that is supposed to do the splitting:

while(<$FH>) {
    chomp;
    $string .= $_;
    last if m/!$/;
}

@rows = split(qr/\$/, $string);
print;          # a dummy line to provide a breakpoint for the debugger

This what the debugger outputs when it gets to the "print" line. The issue I'm trying to deal with appears in lines 3, 7, and 10:

DB<10> p $string
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
DB<11> x @rows
0  '10b2obo12b2o2b'
1  '6b3obob3o8bob3o2b'
2  '2bobo10bo3b2obo4bo2b'
3  "2o4b2o5bo3b4obo\cM3b2o2b"
4  '2bob2o2bo4b3obo5b4obob'
5  '8bo4bo13b3o'
6  '2bob2o2bo4b3obo5b4obob'
7  "2o4b\cM2o5bo3b4obo3b2o2b"
8  '2bobo10bo3b2obo4bo2b'
9  '6b3obob3o8bob3o2b'
10  "10b2obo12b2o!\cM"

You know, changing the file input separator would make this code a lot simpler.

$/ = '$';

my @rows = <$FH>;
chomp @rows;

print "@rows";

The debugger is probably using \cM to represent Ctrl-M which is also known as a carriage return (and sometimes \r or ^M). Text files from Windows use a CR-LF (carriage return, line feed) pair to represent the end of a line. If you read such a file on a Unix system, your chomp will strip off the Unix EOL (a single line feed) but leave the CR as is and you end up with stray CRs in your file.

For a file like you have you can just strip out all the trailing whitespace instead of using chomp:

while(defined(my $line = <$FH>)) {
    $line    =~ s/\s+$//;
    $string .= $line;
    last if($line =~ /!$/);
}

You don't say which OS you're on. Check out binmode and what it has to say about \cM, and that their position coincides with the line endings of your input file:

http://perldoc.perl.org/functions/binmode.html