开发者

How do I extract and parse quoted strings in Perl?

开发者 https://www.devze.com 2022-12-14 10:54 出处:网络
Good day. My text file content below. tmp.txt (a very big size file) constant fixup private AlarmFileName = <A \"C:\\\\TMP\\\\ALARM.LOG\">/* A Format */

Good day.

My text file content below. tmp.txt (a very big size file)

constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">  /* A Format */

constant fixup ConfigAlarms = <U1 0>         /*  U1 Format  */

constant fixup ConfigEvents = <U2 0>         /*  U2 Format  */

My parse code below. The code can't handle C:\\TMP\\ALARM.LOG (quoted string) here. I don't know how to replace the code "s+([a-zA-Z0-9])+>" to handle both the [a-zA-Z0-9] (0 above) string and the quated string ("C:\TMP\ALARM.LOG" above).

$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_XML_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>) 
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
 if ($x == 0)
 {

     if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $6;
                    $name7 = $7;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "开发者_Python百科\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7";
                    $x = 1;
  }
 }
}

Thank you for your input.

**HELLO ALL,

Thanks for so many great solutions. I am a newbew, i would like to do more study based on your post.

THANKS A LOT.**


I'm not going to write your regex for you or give you something to cut and paste into your code. At the rate your regex is going its going to break on the next special case anyway. What I will give you is a better approach.

Split each line into right and left hand side of the assignment.

my($lhs, $rhs) = split m{\s* = \s*}x, $line, 2;

Now its much easier to work with them individually. You can extract information from the left hand side by simply splitting it on whitespace to get all the flags (constant, fixup, etc...) and the last word will be the name being assigned to.

my @flags = split /\s+/, $lhs;
my $name  = pop @flags;

Then you can filter your lines by their @flags, if desired.

And the value, which is presumably inside brackets, can be gotten easily. Using a non-greedy regex ensures it correctly handles something like foo = <bar> /* comment <stuff> */.

my($value) = $rhs =~ /<(.*?)>/;

As you can see from this approach, it avoids having to guess what special keywords (constant, fixup, private) might appear in the file.

I have no idea what else might be in this file, you didn't say.


You've got some major design flaws in your code. I haven't addressed your problem, but I have cleaned up your code.

Most importantly, do not use global variables. In a relatively short chunk of code you are using 3 globals. This is BEGGING for mystery bugs that are impossible to track down. This becomes an even bigger issue as your project grows larger over time.

Look into using Perl::Critic. It will help you improve your code.

Here is an annotated, sanitized version of your code:

# Always use strict and warnings.
# It prevents bugs.
use strict;
use warnings;

my $source_file   = "tmp.txt";
my $dest_xml_file = "my.xml";

# You aren't checking the existence of anyting here:
#Check existance of root directory 
# Is this a TODO item?

# Use 3 argument open with a lexical filehandle.
# Adding $! to your error messages makes them more useful.
open my $source_fh, '<', $source_file
    or die "Fail to open file $source_file - $!";

open( my $dest_fh, '>', $dest_xml_file 
    or die "Coult not open output file $dest_xml_file - $!";

my $x = 0;  # What the heck does this do?  Give it a meaningful name or
            # delete it.

print $dest_fh  "<!-- from tmp.txt-->\n";
while (my $line = <$source_fh>)   
{

    # Don't use global variables.
    # Explicitly pass all data your sub needs.
    # Any values that need to be applied to external 
    # data should be applied by the calling function,
    # from data that is returned.

    $x = ConstantParseAndPrint( $line, $x, $dest_fh );

}

sub ConstantParseAndPrint {
    my $line          = shift;
    my $mystery_value = shift;
    my $fh            = shift;

    if($mystery_value == 0) {

        # qr{} is a handy way to build a regex.
        # using {} instead of // to mark the boundaries helps
        # cut down on the escaping required when your pattern
        # contains the '/' character.

        # Use the x regex modifier to allow whitespace and 
        # comments in your regex.
        # This very is important when you can't avoid using a big, complex regex.

        # But really don't do it this way at all.
        # Do what Schwern says.
        my $line_match = qr{
            ^                      \s*  # Skip leading spaces
            (constant)             \s*  # look for the constant keyword
            (fixup|/\*fixup\*/|)   \s*  # look for the fixup keyword
            (private|)             \s*  # look for the prive keyword
            (\w+)                  \s+  # Get parameter name
            =                      \s+  
            <                           # get bracketed values
            ([a-zA-Z0-9]+)         \s+  # First value 
            ([a-zA-Z0-9])+              # Second value
            >                      \s*
            (/\*\s*(.*?)\s*\*/|)        # Find any trailing comment
            (\r|\n|\s)                  # Trailing whitespace
        }x;


        if( $line =~ /$line_match/ ) {

            # Any time you find yourself making variables
            # with names like $foo1, $foo2, etc, use an array.

            my @names = ( $1, $2, $3, $4, $5, $6, $7 );

            # printf is for printing formatted data.  
            # If you aren't using any format codes, use print.

            # Using an array makes it easy to print all the tokens.
            print $fh "\t\t$_" for @names;

            $mystery_value = 1;

        }
    }

    return $mystery_value;
}

As to your parsing question, follow Schwern's advice. Big, complex regexes are a sign that you need to simplify. Break big problems into manageable tasks.


As was mentioned, you need some structure in your regex. In refatoring your code, I made a couple assumptions

  • You don't want to just print it out in a tabbed delimited format
  • The only reason for the $x variable is so that you only print one line. (although, a last at the end of the loop would have worked just fine.).

Having assumed these things, I decided that, in addressing your question, I would:

  1. Show you how to make a good modifiable regex.
  2. Code very simple "semantic actions" which store the data and let you use it as you please.

In addition is should be noted that I changed input to a __DATA__ section and output is restricted to STDERR--through the use of Smart::Comment comments, that hep me inspect my structures.

First the code preamble.

use strict;   # always in development!
use warnings; # always in development!
use English qw<$LIST_SEPARATOR>; # It's just helpful.
#use re 'debug';
#use Smart::Comments

Note the commented-out use re.... If you really want to see the way a regular expression gets parsed, it will put out a lot of information that you probably don't want to see (but can make your way through--with a little knowledge about regex parsing, nonetheless.) It's commented out because it is just not newbie friendly, and will monopolize your output. (For more about that see re.)

Also commented out is the use Smart::Comments line. I recommend it, but you can get by using Data::Dumper and print Dumper( \%hash ) lines. (See Smart::Comments.)

Specifying the Expression

But on to the regex. I used an exploded form of regex so that the parts of the whole are explained (see perlre). We want a single alphanumeric character OR a quoted string (with allowed escapes).

We also used a list of modifier names, so that the "language" can progress.

The next regex we make in a "do block" or as I like to call it a "localization block", so that I can localize $LIST_SEPARATOR (aka $") to be the regex alternation character. ('|'). Thus when I include the list to be interpolated, it is interpolated as an alternation.

I'll give you time to look at the second regex before talking about it.

# Modifiable list of modifiers
my @mod_names = qw<constant fixup private>;
# Break out the more complex chunks into separate expressions
my $arg2_regex 
    = qr{ \p{IsAlnum}             # accept a single alphanumeric character
        |                         # OR 
          "                       # Starts with a double quote
          (?>                     # -> We just want to group, not capture
                                  # the '?> controls back tracing
              [^\\"\P{IsPrint}]+  # any print character as long as it is not
                                  # a backslash or a double quote
          |   \\"                 # but we will accept a backslash followed by
                                  # a double quote
          |   (\\\\)+             # OR any amount of doubled backslashes
          )*                      # any number of these
          "
        }msx;

my $line_RE 
    = do { local $LIST_SEPARATOR = '|';
           qr{ \A                # the beginning
               \s*               # however much whitespace you need
               # A sequence of modifier names followed by space
               ((?: (?: @mod_names ) \s+ )*)
               ( \p{IsAlnum}+ )  # at least one alphanumeric character
               \s*               # any amount of whitespace
               =                 # an equals sign
               \s*               # any amount of whitespace
               <                 # open angle bracket
                 (\p{IsAlnum}+)  # Alphanumeric identifier
                 \s+             # required whitespace
                 ( $arg2_regex ) # previously specified arg #2 expression
                 [^>]*?
               >                 # close angle bracket
             }msx
             ;   
          }; 

The regex just says that we want any number of recognized "modifiers" separated by whitespace followed by an alphanumeric idenfier (I'm not sure why you don't want underscores; I don't include them, regardless.)

That is followed by any amount of whitespace and an equals sign. Since the sets of alphanumeric characters, whitespace, and the equals sign are all disjoint, there is no reason to require whitespace. On the other side of the equals sign, the value is delimited by angle brackets, so I don't see any reason to require whitespace on that side either. Before the equals all you've allowed is alphanumerics and whitespace and on the other side, it all has to be in angle brackets. Required whitespace gives you nothing, while not requiring it is more fault-tolerant. Ignore all that and change the *s to + if you are expecting a machine output.

On the other side of the equals sign, we require an angle bracket pair. The pair consists of an alphanumeric argument, with the second argument being EITHER a single alphanumeric character (based on your spec) OR a string which can contain escaped escapes or quotes and even the end angle bracket--as long as the string doesn't end.

Storing the Data

Once the specification has been made, here's just one of the things you can do with it. Because I don't know what you wanted to do with this besides print it out--which I'm going to assume is not the whole purpose of the script.

### $line_RE
my %fixup_map;
while ( my $line = <DATA> ) { 
    ### $line
    my ( $mod_text, $identifier, $first_arg, $second_arg ) 
        = ( $line =~ /$line_RE/ )
        ;
    die 'Did not parse!' unless $identifier;
    $fixup_map{$identifier}
        = { modifiers_for => { map { $_ => 1 } split /\s+/, $mod_text }
          , first_arg     => $first_arg
          , second_arg    => $second_arg
          };

    ### $fixup_map{$identifier} : $fixup_map{$identifier}
}
__DATA__
constant fixup ConfigAlarms  = <U1 0>
constant fixup ConfigAlarms2 = <U1 2>
constant fixup private AlarmFileName = <A "C:\\TMP\\ALARM.LOG">

At the end you can see the DATA section, when you're at the beginning stage as you seem to be here, it's most convenient to dispense with IO logic and use the builtin handle DATA as I do here.

I collect the modifiers in a hash, so that my semantic actions could be

#...
my $data = $fixup_map{$id};
#...
if ( $data->{modifiers_for}{public} ) {
    #...
}

Soap Box

The main problem however is that you don't seem to have a plan. For the second "argument" in the angle brakets, you have a regex that specifies only a single alphanumeric character, but want to expand it to allow escaped strings. I have to expect that you are implementing a small subset and gradually want expand it do do other things. If you neglect a good design from the beginning, it's only going to become more and more of a headache to implement the full-featured "parser".

You may want to implement multi-line values at some point. If you don't understand how to get from a single alphanumeric to a quote-delimited argument, the line-by-line method and the adjustments to the regex dwarf that complexity gap.

So I advise you to use the code here only as a guideline for expanding complexity. I'm answering a question and indicating a direction, not designing or coding a project, so my regex code isn't as expandable as it probably should be.

If the parsing job was complex enough, I would specify a minimal lookahead grammar for Parse::RecDescent, and stick to coding the semantic actions. That's another recommendation.


#!/usr/bin/perl


$source_file = "tmp.txt";
$dest_xml_file = "my.xml";

#Check existance of root directory
open(SOURCE_FILE, "$source_file") || die "Fail to open file $source_file";
open(DEST_XML_FILE, ">$dest_xml_file") || die "Coult not open output file $dest_xml_file";

$x = 0;

print DEST_CS_FILE  "<!-- from tmp.txt-->\n";
while (<SOURCE_FILE>)   
{
    &ConstantParseAndPrint;

}

sub ConstantParseAndPrint
{
    if ($x == 0)
    {

#        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9])+>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
        if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)

                {
                    $name1 = $1;
                    $name2 = $2;
                    $name3 = $3;
                    $name4 = $4;
                    $name5 = $5;
                    $name6 = $7;
                    $name7 = $8;
                    printf DEST_XML_FILE "\t\t$name1";
                    printf DEST_XML_FILE "\t\t$name2";
                    printf DEST_XML_FILE "\t\t$name3";
                    printf DEST_XML_FILE "\t\t$name4";
                    printf DEST_XML_FILE "\t\t$name5";
                    printf DEST_XML_FILE "\t\t$name6";
                    printf DEST_XML_FILE "\t\t$name7\n";
#                    $x = 1;
        }
    }
}



Use the following parse code:

if(/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/) 

I have added handling of both the single and double quotes. I use back-reference for quotes matching. Also I have updated the character class for path. i.e. it now includes the colon(:), dot(.), and backslash() along with alpha-numeric characters.


I have intentionally removed match captures (you can add them if you want):

m{^\s*constant\s+fixup\s+(?:private\s+)?\w+\s*=\s*<[^>]+>(?:\s*/\*(?:\s*\w*)+\*/)?$};


unify first!

$yourstring =~ s,\\,/,g;  # transform '\' into '/'
$yourstring =~ s,/+,/,g;  # transform multiple '/' into one '/'
0

精彩评论

暂无评论...
验证码 换一张
取 消