开发者

Split line by 2 rules using the same syntax

开发者 https://www.devze.com 2023-04-03 21:12 出处:网络
I have a line: Jon Favreau, Stan Lee, Justin Theroux, Robert Downey Jr. (Tony Stark) Gwyneth Paltrow (Pepper Potts) Don Cheadle (James Rhodes)

I have a line:

Jon Favreau, Stan Lee, Justin Theroux, Robert Downey Jr. (Tony Stark) Gwyneth Paltrow (Pepper Potts) Don Cheadle (James Rhodes)

I want to split line by comma and close bracket with that result:

Jon Favreau
Stan Lee
Justin Theroux
Robert Downey Jr. (Tony Stark)
Gwyneth Paltrow (Pepper Potts)
Don Cheadle (James Rhodes)

Edit: Special situation

line: Jon Favreau, Stan Lee, Justin Theroux, Robert Downey (Jr开发者_如何学Go.) (Tony Stark) Gwyneth Paltrow (Pepper Potts) Don Cheadle (James Rhodes)

with world (Jr.) in brakets. Output:

Jon Favreau
Stan Lee
Justin Theroux
Robert Downey (Jr.) (Tony Stark)
Gwyneth Paltrow (Pepper Potts)
Don Cheadle (James Rhodes)


When using split, you have decide whether to throw away the delimiters or preserve them. In your case, you want to preserve one delimiter (the close parenthesis) and throw away the other (the comma). In addition, you probably want to throw away any spaces following those delimiters.

Delimiters can be preserved by:

  1. Enclosing the split pattern in capturing parentheses. In this case, the delimiters themselves will end up as separate strings, interspersed with your result, which isn't quite what you want.

  2. Specifying the delimiter within a zero-width assertion (look-behind, look-ahead, etc). This excludes the delimiter from the matched string, thus preventing it from being discarded.

The second approach will work well for you.

my @actors = split /(?<=\)) *|, */, $line;

To handle the more complicated scenario in your edited question, such as "Robert Downey (Jr.) (Tony Stark)", you could add another zero-width assertion:

my $actor_regex = qr'
    (?<=     \) )  # Look-behind: close paren.
    \s*
    (?!  \s* \( )  # Negative look-ahead: opening paren.
    |
    , \s*          # Or the other delimiter.
'x;

my @items = split $actor_regex, $line;


First add a comma after each ) then split (and discard) the commas:

perl -e '$_="Jon Favreau, ...";s/\)/\),/g;split ",";foreach (@_) {s/^\ //;print "$_\n"}'

Yields:

Jon Favreau
Stan Lee
Justin Theroux
Robert Downey Jr. (Tony Stark)
Gwyneth Paltrow (Pepper Potts)
Don Cheadle (James Rhodes)


A helpful rule of thumb attributed to Randal Schwartz is to use split when you know what you want to throw away or m// and capturing parentheses when you know what you want to keep. Applying it to your question, however, is a little tricky because you want to do both. That is, either

  • throw away the terminating comma, or
  • keep the right-parenthesis

The program below uses m// and capturing, so it defines the problem in terms of what it wants to keep. The ) on the end is easy, of course. To keep the comma out of the capture buffer, the code uses a positive look-ahead assertion: the capture should stop on the character just before a comma.

A possibility that's easy to miss is a name should also be permitted to terminate at end-of-string. Say Stan Lee had been the last name rather than the second. Without the $ alternative, Stan would have been left out.

The code uses DEFINE and named subpatterns to help the reader understand the regex. The downside of this approach is it generates extra capture buffers, so you have to use a loop instead of @names = /$name_pattern/g.

As written, it accepts a slightly larger language that what you specified in your question, viz., it permits and discards a comma between two actors who both also have character names.

#! /usr/bin/env perl

use warnings;
use strict;

*ARGV = *DATA; # for demo only

my $name_pattern = qr/
  ( # capture into $1
    (?&name) (?: (?&comma_terminated) | \) | $ )
  )

  # discard trailing whitespace and optional comma
  (?: \s* (?: , \s*)? )

  (?(DEFINE)
    (?<name>             .+?    )
    (?<comma_terminated> (?= ,) )
  )
/x;

while (<>) {
  my @names;
  push @names, $1 while /$name_pattern/gx;

  print "[$_]\n" for @names;
}

__DATA__
Jon Favreau, Stan Lee, Justin Theroux, Robert Downey Jr. (Tony Stark) Gwyneth Paltrow (Pepper Potts) Don Cheadle (James Rhodes) foo

Output:

[Jon Favreau]
[Stan Lee]
[Justin Theroux]
[Robert Downey Jr. (Tony Stark)]
[Gwyneth Paltrow (Pepper Potts)]
[Don Cheadle (James Rhodes)]
[foo]


One way of doing this could be:

my @items = split(/(\)|,)/, $line);

If you print that list out, you'll get something like:

Jon Favreau
,
 Stan Lee
,
 Justin Theroux
,
 Robert Downey Jr. (Tony Stark
)
 Gwyneth Paltrow (Pepper Potts
)
 Don Cheadle (James Rhodes
)

All you need then is to re-assemble the individual items, which are on all the even-numbered positions in that list.


Mat already hit the spot, I just added some cleaning in my version:

my $names =
"Jon Favreau, Stan Lee, Justin Theroux, Robert Downey Jr. (Tony Stark) Gwyneth Paltrow (Pepper Potts) Don Cheadle (James Rhodes)";

my @names = split( /[,|\)]/, $names );
foreach my $name (@names) {
    $name = $name . ")" if $name =~ /.*\(.*/;
    $name =~ s/^ //;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消