开发者

Fast algorithms for finding unique sets in two very long sequences of text

开发者 https://www.devze.com 2023-01-13 02:13 出处:网络
I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in th

I need to compare the DNA sequences of X and Y chromosomes, and find patterns (composed of around 50-75 base pairs) that are unique to the Y chromosome. Note that these sequence parts can repeat in the chromosome. This needs to be done quickly (BLAST takes 47 days, need a few hours or less). Are there any algorithms or programs in particular suited to this kind of comparison? Again, speed is the key here.

One of t开发者_开发百科he reasons I put this on SO was to get perspective from people outside the specific application domain, who can put forth algorithms they use in string comparison in their daily use, that may apply to our use. So don't be shy!


  1. Build a suffix tree S on sequence X.
  2. For each starting position i in sequence Y, look for the string Y[i..i+75] in S. If no match can be found starting at position i (i.e. if lookup fails after j < 75 nucleotides matched) then you have found a length-j string unique to Y.
  3. The smallest such string over all starting positions i is the shortest unique string (or just halt after you find any such string if you aren't interested in minimising the length).

Total time: O(|X| + m|Y|) where m is the maximum string length (e.g. m = 75).

There are probably even more efficient algorithms based on generalised suffix trees.


I am assuming that you have a single X and a single Y to compare. Concatenate them, separated by a marker character that does not appear in either, to form e.g. XoY. Now form the http://en.wikipedia.org/wiki/Suffix_array in linear time.

What you get is an array of pointers to positions in the concatenated string, where the pointers are arranged so that the substrings they point to appear in alphabetical order in the array. You also get an LCP array, giving the length of the longest common prefix shared between a suffix and the suffix directly before it in the array, which is the suffix that sorts just less than it. This is in fact the longest common prefix shared between that position and ANY substring less than it, because anything with a longer common prefix and less than string[i] would sort between it and the current string[i - 1].

You can tell which original string a pointer points into from its position, because X comes before Y. You can cut the array up into alternating sub-sections of X and Y pointers. The length of the common prefix shared between pos[i] and pos[i - 1] is lcp[i]. The length of the prefix shared between pos[i] and pos[i-2] is min(lcp[i], lcp[i-1]). So if you start at the Y value just before a range of Xs you can work out the number of characters of prefix between that Y and all of the Xs in turn by stepping down the section, doing a min operation at each step. Similarly you can work out the number of characters of prefix shared between all of those Xs and the Y that appears next in the suffix array at the cost of one min per X. Ditto, of course for the Y ranges. Now do a max per entry to work out the longest prefix shared between each position in X (or Y) and any position in Y (or X).

I think you want the substrings within either X or Y which have small prefixes shared between it and any other substring of the other sex, because the strings one character longer than this starting from the same position do not appear in the other sex at all. I think once you have done the min() calculations above you can extract the N smallest prefix substrings using a heap to keep track of the N smallest entries. I think everything here takes time linear in |X| + |Y| (unless N is comparable to |X| or |Y|).


This paper might have some alternatives for adapting BLAST to improve its performance (by sub-dividing the problem space AFAIKS).


I have an interesting answer, it will be technological one. Main idea is that comparisons of sub-sequences should be done on GPU, because GPU of modern video cards is highly parallel processing environment (like small supercomputer). So we can encode base pair as one pixel, given that X chromosome is 154 million pairs- we get an image for X chromosome that consists of 154 million pixels - this image size will be about 500 MB. For Y chromosome we get an image of 160 MB in size. So these two (500+160) MB images could be handled by descent video card very effectively. (Just get a video card with >= 1 GB of video ram).

Next step is to write GPU program, maybe with the help of Pixel Shader or Cuda or OpenCL

GPU program would be simple - basically it will compare 50-75 neighboring pixels in Y chromosome image to all pixels of X chromosome image. So this GPU program should have maximum of 75*154 million operations, which will be computed on modern GPU in an HOUR or so. Because all sub-sequences of Y will be tested in parallel.

hope that helps

0

精彩评论

暂无评论...
验证码 换一张
取 消