开发者

How do I split Chinese characters one by one?

开发者 https://www.devze.com 2022-12-28 15:29 出处:网络
If there is no special character(such as white space, : etc) between firstname and lastname. Th开发者_如何学JAVAen how to split the Chinese characters below.

If there is no special character(such as white space, : etc) between firstname and lastname.

Th开发者_如何学JAVAen how to split the Chinese characters below.

use strict; 
use warnings; 
use Data::Dumper;  

my $fh = \*DATA;  
my $fname; # 小三; 
my $lname; # 张 ;
while(my $name = <$fh>)
{

    $name =~ ??? ;
    print $fname"/n";
    print $lname;

}

__DATA__  
张小三

Output

小三
张

[Update]

WinXP. ActivePerl5.10.1 used.


You have problems because you neglect to decode binary data to Perl strings during input and encode Perl strings to binary data during output. The reason for this is that regular expressions and its friend split work properly on Perl strings.

(?<=.) means "after the first character". As such, this program will not work correctly on 复姓/compound family names; keep in mind that they are rare, but do exist. In order to always correctly split a name into family name and given name parts, you need to use a dictionary with family names.

Linux version:

use strict;
use warnings;
use Encode qw(decode encode);

while (my $full_name = <DATA>) {
    $full_name = decode('UTF-8', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('UTF-8',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

Output:

The full name is 张小三, the family name is 张, the given name is 小三.

Windows version:

use strict;
use warnings;
use Encode qw(decode encode);
use Encode::HanExtra qw();

while (my $full_name = <DATA>) {
    $full_name = decode('GB18030', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('GB18030',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

Output:

The full name is 张小三, the family name is 张, the given name is 小三.


You'll need some kind of heuristic to separate the first and last names. Here's some working code that assumes that the last name (surname) is one character (the first) and all the remaining characters (at least one) belong to the first name (given name):

EDIT: Changed program to ignore invalid lines rather than dying.

use strict;
use utf8;

binmode STDOUT, ":utf8";

while (my $name = <DATA>) {
    my ($lname, $fname) = $name =~ /^(\p{Han})(\p{Han}+)$/ or next;
    print "First name: $fname\nLast name: $lname\n";
}

__DATA__  
张小三

When I run this program from the command line, I get this output:

First name: 小三
Last name: 张


This splits the characters and assigns them to $fname and $lname.

my ($fname, $lname) = $name =~ m/ ( \X ) /gx;

Though I think your example and your question don't really match (the lastname has two characters.

0

精彩评论

暂无评论...
验证码 换一张
取 消