How to use Perl to parse specified formatted text with regex?_问答_开发者

Question abstract:

how to parse text file into two "hashes" in Perl. One store key-value pairs taken from the (X=Y) part, another from the (X:Y) part?

they are kept in one file, and only the symbol between the two digits denotes the difference.

===============================================================================

I just spent around 30 hours learning Perl during last semester and managed to finish my Perl assignment in an "head first, ad-hoc, ugly" way.

Just received my result for this section as 7/10, to be frank, I am not happy with this, particularly because it recalls my poor memory of trying to use Regular Expression to deal with formatted data, which rule is like this :

1= (the last digit in your student ID,or one if this digit is zero)  
2= (the second last digit in your student ID,or one if this digit is zero)
3= (the third last digit in your student ID, or one if this digit is zero)
4= (the forth last digit in your student ID, or one if this digit is zero)

2:1 
3:1  
4:1  
1:2  
1:3  
1:4  
2:3 (if the last digit in your student ID is between 0 and 4) OR
    3:4 (if the last digit in your student ID is between 5 and 9)
3:2 (if the second last digit in your student ID is between 0 and 4) OR
    4:3 (if the second last digit in your student ID is between 5 and 9)

An example of the above configuration file: if your student ID is 10926029, it has to be:

1=9  
2=2  
3=1  
4=6  
2:1  
3:1  
4:1  
1:2
1:3  
1:4  
3:4  
3:2

The assignment was about Pagerank calculation, which algorithm is simplified 开发者_运维百科so I came up with the answer to that part in 5 minutes. However, it was the text parsing part that took me heaps of time.

The first part of the text (Page=Pagerank) denotes the pages and their corresponding pageranks.

The second part (FromNode:ToNode) denotes the direction of a link between two pages.

For a better understanding, please go to my website and check the requirement file and my Perl script here

There are massive comments in the script so I reckon it is not hard at all to see how stupid I was in my solution :(

If you are still on this page, let me justify why I ask this question here in SO:

I got nothing else but "Result 7/10" with no comment from uni.

I am not studying for uni, I am learning for myself.

So, I hope the Perl gurus can at least guide me the right direction toward solving this problem. My stupid solution was sort of "generic" and probable would work in Java, C#, etc. I am sure that is not even close to the nature of Perl.

And, if possible, please let me know the level of solution, like I need to go through "Learning Perl ==> Programming Perl ==> Master Perl" to get there :)

Thanks for any hint and suggestion in advance.

Edit 1:

I have another question posted but closed here, which describes pretty much like how things go in my uni :(

Is this what you mean? The regex basically has three capture groups (denoted by the ()s). It should capture one digit, followed by either = or : (that's the capture group wrapping the character class [], which matches any character within it), followed by another single digit.

my ( %assign, %colon );

while (<DATA>) {
    chomp;                     
    my ($l, $c, $r) = $_ =~ m/(\d)([=:])(\d)/;

    if    ( q{=} eq $c ) { $assign{$l} = $r; }
    elsif ( q{:} eq $c ) { $colon{$l}  = $r; }
}        

__DATA__
1=9  
2=2  
3=1  
4=6  
2:1  
3:1  
4:1  
1:2
1:3  
1:4  
3:4  
3:2

As for the recommendation, grab a copy of Mastering Regular Expressions if you can. It's very...thorough.

Well, if you don't want to validate any restrictions on the data file, you can parse this data pretty easily. The main issue lies in selecting the appropriate structure to store your data.

use strict;
use warnings;

use IO::File;

my $file_path = shift;  # Take file from command line

my %page_rank;
my %links;

my $fh = IO::File->new( $file_path, '<' )
    or die "Error opening $file_path - $!\n";

while ( my $line = $fh->readline ) {
    chomp $line;

    next unless $line =~ /^(\d+)([=:])(\d+)$/; # skip invalid lines

    my $page      = $1;
    my $delimiter = $2; 
    my $value     = $3;


    if( $delimiter eq '=' ) {

        $page_rank{$page} = $value;
    }
    elsif( $delimiter eq ':' ) {

        $links{$page} = [] unless exists $links{$page};

        push @{ $links{$page} }, $value;
    }

}

use Data::Dumper;
print Dumper \%page_rank;
print Dumper \%links;

The main way that this code differs from Pedro Silva's is that mine is more verbose and it also handles multiple links from one page properly. For example, my code preserves all values for links from page 1. Pedro's code discards all but the last.