开发者

Heavy computations analysis/optimization

开发者 https://www.devze.com 2023-01-08 11:28 出处:网络
First of all, I don\'t have multiplication, division operations so i could use shifting/adding, overflow-multiplication, precalculations etc. I\'m just comparing one n-bit binary number to another, bu

First of all, I don't have multiplication, division operations so i could use shifting/adding, overflow-multiplication, precalculations etc. I'm just comparing one n-bit binary number to another, but according to algorithm the quantity of such operations seems to be huge. Here it is :

  1. There is given a sequence of 0's and 1's that is divided into blocks. Let the length of a sequence be S, the length of a block is N which is some power of two (4,8,16,32, etc.). Quantity of blocks is n=S/N, no rocket science here.
  2. According to chosen N i'm building a set of all possible N-bit binary numbers, which is a collection of 2^N-1 objects.
  3. After this I need to compare each binary number with each block from source sequenc开发者_StackOverflow中文版e and calculate how much times there was a match for each binary number, for example :

    S : 000000001111111100000000111111110000000011111111... (0000000011111111 is repeated 6 times, 16bit x 6 = 96bits overall)

    N : 8

    blocks : {00000000, 11111111, 00000000, 1111111,...}

    calculations:

.

// _n = S/N;
// _N2 = Math.Pow(2,N)-1
// S=96, N=8, n=12, 2^N-1=255 for this specific case
// sourceEpsilons = list of blocks from input, List<string>[_n]
var X = new int[_n]; // result array of frequencies
for (var i = 0; i < X.Length; i++) X[i] = 0; // setting up
for (ulong l = 0; l <= _N2; l++) // loop from 0 to max N-bit binary number
var currentl = l.ToBinaryNumberString(_N/8); // converting counter to string, getting "current binary number as string"
var sum = 0; // quantity of currentl numbers in blocks array
for (long i = 0; i < sourceEpsilons.LongLength; i++)
{
   if (currentl == sourceEpsilons[i]) sum++; // evaluations of strings, evaluation of numbers (longs) takes the same time
}
// sum is different each time, != blocks quantity                    
for (var j = 0; j < X.Length; j++) 
if (sum - 1 == j) X[j]++; // further processing
// result : 00000000 was matched 6 times, 11111111 6 times, X[6]=2. Don't ask me why do i need this >_<

With even small S i seem to have (2^N-1)(S/N) iterations, with N=64 the number grows to 2^64=(max value of type long) so that ain't pretty. I'm sure there is a need to optimize loops and maybe change the approach cardinally (c# implementation for N=32 takes 2h @ dual-core pc w/ Parallel.For). Any ideas how to make the above scheme less time and resource-consuming? It seems like i have to precompute binary numbers and get rid of first loop by reading "i" from file and evaluate it with blocks "on-the-fly", but the filesize will be (2^N)*N bytes ((2^N-1)+1)*N) which is somehow unacceptable too.


It seems like what you want is a count of how many times each specific block occurred in your sequence; if that's the case, comparing every block to all possible blocks and then tallying is a horrible way to go about it. You're much better off making a dictionary that maps blocks to counts; something like this:

var dict = Dictionary<int, int>();
for (int j=0; j<blocks_count; j++)
{
    int count;
    if (dict.TryGetValue(block[j], out count)) // block seen before, so increment
    {
        dict[block[j]] = count + 1;
    }
    else // first time seeing this block, so set count to 1
    {
        dict[block[j]] = 1; 
    }
}

After this, the count q for any particular block will be in dict[the_block], and if that key doesn't exist, then the count is 0.


I'm just comparing one n-bit binary number to another

Isn't that what memcmp is for?

You're looping through every possible integer value, and it's taking 2 hours, and you're surprised at this? There's not much you can do to streamline things if you need to iterate that much.


Are you trying to get the number of unique messages in S? For instance in your given example, for N = 2, you get 2 messages (00 and 11), for N = 4 you get 2 messages, (0000 and 1111), and for N = 8 you get 1 message (00001111). If that's the case, then the dictionary approach suggested by tzaman is one way to go. Another would be sort the list first, then run through it and look for each message. A third, naive, approach would be to use a sentinel message, all 0's for instance, and run through looking for messages that are not the sentinel. When you find one, destroy all its copies by setting them to the sentinel. For instance:

int CountMessages(char[] S, int SLen, int N) {
    int rslt = 0;
    int i, j;
    char *sentinel;

    sentinel = calloc((N+1)*sizeof(char));

    for (i = 0; i < N; i ++)
        sentinel[i] = '0';

    //first, is there a sentinel message?
    for (i = 0; ((i < SLen) && (rslt == 0)); i += N) {
        if (strncmp(S[i], sentinel, N) == 0)
            rslt++;
    }

    //now destroy the list and get only the unique messages
    for (i = 0; i < SLen; i += N) {
        if (strncmp(S[i], sentinel, N) != 0) { //first instance of given message
            rslt++;                
            for (j = i+N; j < SLen; j += N) { //look for all remaining instances of this message and destroy them
                if (strncmp(S[i], S[j], N) == 0)
                    strncpy(S[j], sentinel, N); //destroy message
            }
        }
    }

    return rslt;
}

The first means using either a pre-written dictionary or writing your own. The second and third destroy the list, meaning you have to use a copy for each 'N' you want to test, but are pretty easy. As for parallelization, the dictionary is the easiest, since you can break the string into as many sections as you have threads, do a dictionary for each, then combine the dictionaries themselves to get the final counts. For the second, I imagine the sort itself can be made parallel easily, then there's a final pass to get the count. The third would require you to do the sentinel-ization on each substring, then redo it on the final recombined string.

Note the big idea here though: rather than looping through all the possible answers, you only loop over all the data!


Instead of a dictionary, you can also use a flat file, of size 2^N entries, each of a size of, for example integer.

This would be your counting pad. Instead of looping through all possible numbers in a collection, and comparing to your currently viewed number, you iterate through S forward only like such:

procedure INITIALIZEFLATFILE is
    allocate 2^N * sizeof(integer) bytes to FLATFILE
end procedure

procedure COUNT is
    while STREAM is not at END
        from FLATFILE at address STREAM.CURRENTVALUE read integer into COUNT
        with FLATFILE at address STREAM.CURRENTVALUE write integer COUNT+1
        increment STREAM
    end while
end procedure

A dictionary is conservative on space in the beginning, and requires a lookup to the proper index later on. If you expect all possible integers eventually, you can keep a fixed-size "scorecard" from the getgo.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号