Manipulating very large binary strings in c#_问答_开发者

I'm working on a gene开发者_运维百科tic algorithm project where I encode my chromosome in a binary string where each 32 bits represents a value. The problem is that the functions I'm optimizing has over 3000 parameters which implies that I have over 96000 bits in my bit string and the manipulations i do on this are simply to slow...

I have proceeded as following: I have a binary class where I'm creating a boolean array that represents my big binary string. Then I manipulate this binary string with various shifts and moves and such.

My question is, is there a better way to do this? The speed is just killing. I'm sure it would be fine if i only did this on one bit string but i have to do the manipulations on 25 bit strings for way over 1000 generations.

What I would do is take a step back. Your analysis seems to be wedded to an implementation detail, namely that you have chosen bool[] as how you represent a string of bits.

Clear your mind of bools and arrays and make a complete list of the operations you actually need to perform, how frequently they happen, and how fast they have to be. Ideally consider whether your speed requirement is average speed or worst case speed. (There are many data structures that attain high average speed by having one expensive operation for every thousand cheap operations; if having any expensive operations is unacceptable then you need to know that up front.)

Once you have that list, you can then do research on what data structures work well.

For example, suppose your list of operations is:

construct bit sequences on the order of 32 bits
concatenate on the order of 3000 bit sequences together to form new bit sequences
insert new bit sequences into existing long bit sequences at specific locations, quickly

Given just that list of operations, I'd think that the data structure you want is a catenable deque. Catenable deques support fast insertion on either end, and can be broken up into two deques efficiently. Inserting stuff into the middle of a deque is then easily done: break the deque up, insert the item into the end of one half, and join them back up again.

However, if you then add another operation to the problem, say, "search for a particular bit string anywhere in the 90000-bit sequence, and find the result in sublinear time" then just a catenable deque isn't going to do it. Searching a deque is slow. There are other data structures that support that operation.

If I understood correctly you are encoding the bit array in a bool[]. The first obvious optimisation would be to change this to int[] (or even long[]) and take advantage of bit operations on a whole machine word, where possible.

For example, this would make shifts more efficient by ~ a factor 4.

Is the BitArray class no help?

A BitArray would probably be faster than a boolean array but you would still not get built-in support to shift 96000 bits.

GPUs are extremely good at massive bit operations. Maybe Brahma, CUDA.NET, or Accelerator could be of service?

Let me understand; currently, you're using a sequence of 32-bit values for a "chromosome". Are we talking about DNA chromosomes or neuroevolutionary algorithmic chromosomes?

If it's DNA, you deal with 4 values; A,C,G,T. That can be coded in 2 bits, making a byte able to hold 4 values. Your 3000-element chromosome sequence can be stored in a 750-element byte array; that's nothing, really.

Your two most expensive operations are to and from the compressed bitstream. I would recommend a byte-keyed enum:

public enum DnaMarker : byte { A, C, G, T };

Then, you go from 4 of these to a byte with one operation:

public static byte ToByteCode(this DnaMarker[] markers)
{
   byte output = 0;
   for(byte i=0;i<4;i++)
      output = (output << 2) + (byte)markers[i];
}

... and parse them back out with something like this:

public static DnaMarker[] ToMarkers(this byte input)
{
   var result = new byte[4];
   for(byte i=0;i<4;i++)
       result[i] = (DnaMarker)(input - (input >> (2*(i+1))));

   return result;
}

You might see a slight performance increase using four parameters (output if necessary) versus allocating and using an array in the heap. But, you lose the iteration which makes the code more compact.

Now, because you're packing them into four-byte "blocks", if your sequence length isn't always an exact multiple of four you'll end up "padding" the end of your block with zero values (A). Working around this is messy, but if you had a 32-bit integer that told you the exact number of markers, you can simply discard anything more you find in the bytestream.

From here, possibilities are endless; you can convert the enum array to a string by simply calling ToString() on each one, and likewise you can feed in a string and get an enum array by iterating through using Enum.Parse().

And always remember, unless memory is at a premium (usually it isn't), it's almost always faster to deal with the data in an easily-usable format instead of the most compact format. The one big exception is in network transmission; if you had to send 750 bytes vs 12KB over the Internet, there's an obvious advantage in the smaller size.