I am trying to perform some composition-based filtering on a large collection of strings (protein sequences).
I wrote a group of three subroutines in order to take care of it, but I'm running into trouble in two ways - one minor, one major. The minor trouble is that when I use List::MoreUtils 'pairwise' I get warnings about using$a
and $b
only once and them being uninitialized. But I believe I'm calling this method properly (based on CPAN's entry for it and some examples from the web).
The major trouble is an error "Can't use string ("17/32") as HASH ref while "strict refs" in use..."
It seems like this can only happen if the foreach
loop in &comp
is giving the hash values as a string instead of evaluating the division operation. I'm sure I've made a rookie mistake, but can't find the answer on the web. The first time I even looked at perl code was last Wednesday...
use List::Util;
use List::MoreUtils;
my @alphabet = (
'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V'
);
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
my %counts = ();
foreach my $chr (@alphabet) {
$counts{$chr} = ( $_[0] =~ tr/$chr/$chr/ );
}
$counts{'gap'} = ( $_[0] =~ tr/$gapchr/$gapchr/ );
return %counts;
}
# Takes a sequence and returns letter => fractional composition pairs as a hash.
sub comp {
my %comp = getcounts( $_[0] );
foreach my $chr (@alphabet) {
$comp{$chr} = $comp{$chr} / ( length( $_[0] ) - $comp{'gap'} );
}
return %comp;
}
# Takes two sequences and returns a measure of the composition difference between them, as a scalar.
# Originally all on one line but it was unreadable.
sub dcomp {
my @dcomp = pairwise { $a - $b } @{ values( %{ comp( $_[0] ) } ) }, @{ values( %{ comp( $_[1] ) } ) };
@dcomp = apply { $_ ** 2 } @dcomp;
m开发者_开发知识库y $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
Much appreciation for any answers or advice!
There are a few bugs in your code. First, note from perldoc perlop:
Because the transliteration table is built at compile time, neither the
SEARCHLIST
nor theREPLACEMENTLIST
are subjected to double quote interpolation.
So your counting method is incorrect. I also believe you are misusing pairwise
. It is hard to evaluate what constitutes correct usage because you do not give examples of what output you should get with some simple inputs.
In any case, I would re-write this script as (there are some debugging statements sprinkled in):
#!/usr/bin/perl
use List::AllUtils qw( sum );
use YAML;
our ($a, $b);
my @alphabet = ('A' .. 'Z');
my $gap = '-';
my $seq1 = 'ABCD-EFGH--MNOP';
my $seq2 = 'EFGH-ZZZH-KLMN';
print composition_difference($seq1, $seq2);
sub getcounts {
my ($seq) = @_;
my %counts;
my $pattern = join '|', @alphabet, $gap;
$counts{$1} ++ while $seq =~ /($pattern)/g;
warn Dump \%counts;
return \%counts;
}
sub fractional_composition_pairs {
my ($seq) = @_;
my $comp = getcounts( $seq );
my $denom = length $seq - $comp->{$gap};
$comp->{$_} /= $denom for @alphabet;
warn Dump $comp;
return $comp;
}
sub composition_difference {
# I think your use of pairwise in the original script
# is very buggy unless every sequence always contains
# all the letters in the alphabet and the gap character.
# Is the gap character supposed to factor in the computations here?
my ($comp1, $comp2) = map { fractional_composition_pairs($_) } @_;
my %union;
++ $union{$_} for (keys %$comp1, keys %$comp2);
my $dcomp;
{
no warnings 'uninitialized';
$dcomp = sum map {
($comp1->{$_} - $comp2->{$_}) ** 2
} keys %union;
}
return sqrt( $dcomp ) / 20; # where did 20 come from?
}
%{ $foo }
will treat $foo
as a hash reference and dereference it; similarly, @{}
will dereference array references. Since comp
returns a hash as a list (hashes becomes lists when passed to and from functions) and not a hash reference, the %{}
is wrong. You could potentially leave off the %{}
, but values
is a special form and needs a hash, not a hash passed as a list. To pass the result of comp
to values
, comp
needs to return a hash ref that then gets dereferenced.
There's another problem with your dcomp
, namely that the order of values
(as the documentation says) "are returned in an apparently random order", so the values passed to the pairwise
block aren't necessarily for the same character. Instead of values
, you can use hash slices. We're now back to comp
returning a hash (as a list).
sub dcomp {
my %ahisto = comp($_[0]);
my %bhisto = comp($_[1]);
my @keys = uniq keys %ahisto, keys %bhisto;
my @dcomp = pairwise { $a - $b } , @ahisto{@keys}, @bhisto{@keys};
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
This doesn't address what happens if a character appears in only one of $_[0]
and $_[1]
.
uniq
left as an exercise for the reader.
Just going through the code that you provided, this is how I would have written it. I don't know if this will work the way you wanted it to work though.
use strict;
use warnings;
our( $a, $b );
use List::Util;
use List::MoreUtils;
my @alphabet = split '', 'ARNDCQEGHILKMFPSTWYV';
my $gapchr = '-';
# Takes a sequence and returns letter => occurrence count pairs as hash.
sub getcounts {
my( $sequence ) = @_;
my %counts;
for my $chr (@alphabet) {
$counts{$chr} = () = $sequence =~ /($chr)/g;
# () = forces list context
}
$counts{'gap'} = () = $sequence =~ /($gapchr)/g;
return %counts if wantarray; # list context
return \%counts; # scalar context
# which is what happens inside of %{ }
}
# Takes a sequence and returns letter => fractional composition pairs as a hash
sub comp {
my( $sequence ) = @_;
my %counts = getcounts( $sequence );
my %comp;
for my $chr (@alphabet) {
$comp{$chr} = $comp{$chr} / ( length( $sequence ) - $counts{'gap'} );
}
return %comp if wantarray; # list context
return \%comp; # scalar context
}
# Takes two sequences and returns a measure of the composition difference
# between them, as a scalar.
sub dcomp {
my( $seq1, $seq2 ) = @_;
my @dcomp = pairwise { $a - $b }
@{[ values( %{ comp( $seq1 ) } ) ]},
@{[ values( %{ comp( $seq2 ) } ) ]};
# @{[ ]} makes a list into an array reference, then dereferences it.
# values always returns a list
# a list, or array in scalar context, returns the number of elements
# ${ } @{ } and %{ } forces their contents into scalar context
@dcomp = apply { $_ ** 2 } @dcomp;
my $dcomp = sqrt( sum( 0, @dcomp ) ) / 20;
return $dcomp;
}
One of the most important things you need to know are the differences between scalar, list, and void contexts. This is because everything behaves different, in the different contexts.
re: Minor problem
This is fine and a common problem with (some) of the List::Util
and List::MoreUtils
modules.
One way to remove warnings is simply to declare those special variables
in advance like so:
our ($a, $b);
Another would be to precede the pairwise
with:
no warnings 'once';
See perlvar for more info about $a and $b
/I3az/
精彩评论