开发者

How do I find the count and location of two substrings in two different files?

开发者 https://www.devze.com 2023-02-08 10:55 出处:网络
From two seque开发者_StackOverflow社区nces given I need to check for every three codons and if the changes are same as in the following list, then I have to check out the location of changes and the c

From two seque开发者_StackOverflow社区nces given I need to check for every three codons and if the changes are same as in the following list, then I have to check out the location of changes and the codons which are changed and count their number of occurrences.

For example:

sequence 1 - TTCAUUUCCCAU
sequence 2 - TTTAUAUCGCAC

The output which I need to get is

TTC->TTT considered/location-1/count-1
AUU->AUA considered/location-2/count-1
UCC->UCG considered/location-3/count-1

NOTE: CAU->CAC not considered because it is not there in the following list. LIST:-> The direction of changes should also be considered.

first sequence->second sequence
TTC->TTT
CTG->UUA
AUU->AUA
GUG->GUA
UCC->UCG
CCC->CCG
ACC->ACG
GCC->GCG
UAC->UAU
UGA->UAG
CAC->CAU
CAG->CAA
AAC->AAU
AAG->AAA
GAC->GAU
GAG->GAA
UGC->UGU
CGG->CGU
AGC->AGU
AGG->CGU
AGA->CGU
UAA->UAG
GGC->GGU

The code which I have written until now is:

print "Enter the sequence:";
$a = <>;

print "Enter the mutated sequence:";
$b = <>;

chomp($a);
chomp($b);

my @codon = split(/(\w{3})/, $a);
my @codon1 = split(/(\w{3})/, $b);

open(OUT, ">output.txt") or die;
$count = 0;
@new = ();
@new1 = ();
for ($i = 0; $i <= $#codon; $i++) {
    for ($j = 0; $j <= $#codon1; $j++) {
        if ($codon[$i] = {TTC}) || ($codon1[$j] = {TTT}) {
            $count++;
        }
    }
}
print OUT " @new";
close OUT;


#!/usr/bin/env perl
use strict;

my %seq_map = (
    "TTC"=>"TTT",
    "CTG"=>"UUA",
    "AUU"=>"AUA",
    "GUG"=>"GUA",
    "UCC"=>"UCG",
    "CCC"=>"CCG",
    "ACC"=>"ACG",
    "GCC"=>"GCG",
    "UAC"=>"UAU",
    "UGA"=>"UAG",
    "CAC"=>"CAU",
    "CAG"=>"CAA",
    "AAC"=>"AAU",
    "AAG"=>"AAA",
    "GAC"=>"GAU",
    "GAG"=>"GAA",
    "UGC"=>"UGU",
    "CGG"=>"CGU",
    "AGC"=>"AGU",
    "AGG"=>"CGU",
    "AGA"=>"CGU",
    "UAA"=>"UAG",
    "GGC"=>"GGU"
);

my %seq_count = ();

my $seq1 = "TTCAUUUCCCAU";
my $seq2 = "TTTAUAUCGCAC";

my $max = int(length($seq1) / 3);
for(my $i=0;$i<$max;$i++) {
    my $c1 = substr($seq1, $i*3, 3);
    my $c2 = substr($seq2, $i*3, 3);
    my $found = $seq_map{$c1};
    if ($found && ($found eq $c2)) {
        $seq_count{$c1} ||= 0;
        my $count = ++$seq_count{$c1};
        my $loc = $i+1;
        print "${c1}->${c2} considered / location ${loc} / count ${count}\n";
    }
}


There are many ways to accomplish this, as is the case typically in Perl.

If the file is not large, you can read in the file line by line into an array (or if it is already one entry per line, then just slurp the whole file into an array). Then use a while loop (and the second file's file handle) to compare the position of the dinucleotides.

Because this is a bioinformatics problem, and the files are typically large, I would be smart and look into reading from each file handle, line by line, and doing comparasons.

For the 3 character split you are trying to do, I would use a for loop, going until the length of the string you are checking divided by 3 -1. Then create a regex as you go on to grab the first three letters, then the next, and so on…

Something like /\d{$count}(\w{3})/

The while loop could look something like this:

#!/usr/bin/perl -w
use strict;

open FILE1, "file1.txt" or die "Cannot open file1.txt: $!\n";
open FILE2, "file2.txt" or die "Cannot open file2.txt: $!\n";

my $count = 0;
while (<FILE1>) {
    chomp(my $lineF1 = $_);
    chomp(my $lineF2 = <FILE2>);

    # some changes may need to be made to this if statement
    if ($lineF1 eq $lineF2) {
        # do something important here
        print "$lineF1\n";
    } else {
        print "Line $count mismatch\n";
    }
    $count++;
}

close(FILE1);
close(FILE2);


Can you consider that the codons in the two files are "aligned"? If that's the case, the problem is simple: you load the list of valid transitions in a 2-level hash:

 # of course, you load this from a file...
 $transitions{TTC}{TTT} = 1;
 $transitions{CTG}{UUA} = 1;
 ...

Then, reading both files, line by line (or are they just one string?):

# of course, I'm leaving out all the file manipulation...
my $line1 = <FILE1>;
my $line2 = <FILE2>;

my $maxlen1 = length($line1);
my $maxlen2 = length($line2);
my $i = 0;

while($i < $maxlen1 && $i < $maxlen2){
  my $codon1 = substr($line1, $i, $i+3);
  if(exists($transitions{$codon1}){
    my $codon2 = substr($line2, $i, $i+3);
    if(exists($transitions{$codon1}{$codon2}){
      print "we have a match $codon1 -> $codon2 at index $i\n";
    }
  }
  $i += 3;
}

NOTE use 'exists() instead of defined() as it will save you some extra computation. If you don't want to have nexted if(), you can compute $codon1 and $codon2 and then check for if(exists($transitions{$codon1}{$codon2})) {} Using 'exists' avoids the autovivification problem...

0

精彩评论

暂无评论...
验证码 换一张
取 消