开发者

strstr() equivalent in C#

开发者 https://www.devze.com 2023-01-13 00:47 出处:网络
I have two byte[] and I want to find the first occurrence of the second byte[] in the first byte[] (or a range in it).

I have two byte[] and I want to find the first occurrence of the second byte[] in the first byte[] (or a range in it).

I don't want to use strings for efficiency (translating the first byte[] to a string will be inefficient).

Basically I believe that's what strstr() does in C.

What is the best way to do that (so it be efficient and easy to use)?

This is how it should look like:

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray);

Thanks!

UPDATE:

I want a solution that would be more efficient than a simple search. This means that using the fact that comparing buffers can be more efficient should be used - memcmp() is more efficient than iterating over the bytes.

Also, I know there are algorithms that optimize scenario开发者_如何学编程s like this one:

  • big array: "12312351231234"
  • small array: "1231234"
  • Naive algorithm: 7 compares to find that "1231235" is different than "1231234", 2 compares to find the next "1", 4 compares to find that "1235" is different than "1231", 3 compares to find the next "1", 7 compares to find match. A total of 7+2+4+3+7=23 compares.
  • Smart algorithm: 7 compares to find that "1231235" is different than "1231234", directly jumps to the next "1" (without comparing), 4 compares to find that "1235" is different than "1231", directly jumps beyond the "5", 7 compares to find the match. A total of 7+4+7=18 compares.


I don't have any code for you but the name of the fastest solution you will find is the Boyer-Moore algorithm. It can do better than O(n).

Here is an implementation for strings on CodeProject. Looks like a conversion to byte[] should not be too difficult.


int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         if (bigArrayOffset + smallArray.Length > bigArray.Length)
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 ++bigArrayOffset;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}

UPDATED; (hopefully) Fixed problem Henk alerted me to.

UPDATE 2: Addressing update to original question:

int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, 
                               int bigArrayCount, byte[] smallArray)
{
     int bigArrayEnd = Math.Min(bigArrayCount + bigArrayOffset, bigArray.Length)
     byte first = smallArray[0];
     bool cont= true;
     while (cont && 
            bigArrayOffset=Array.IndexOf(bigArray, first, bigArrayOffset) != -1)
     {
         int bookmark = bigArrauOffset + 1;
         bool bookmarkset = false;
         if (bigArrayOffset + smallArray.Length > bigArrayEnd )
         {
              bigArrayOffset = -1;
              break;
         }
         cont= false;
         for(int i=1; i< smallArray.Length; ++i)
         {
              if (!bookmarkset && bigArray[bigArrayOffset+i] == first)
              {
                   bookmark = bigArrayOffset+i;
                   bookmarkset = true;
              }
              if (bigArray[bigArrayOffset+i] != smallArray[i])
              { 
                 bigArrayOffset = bookmark;
                 cont = true;
                 break;
              }
         }
     }
     return bigArrayOffset;
}


In algorithm theory, it is well known that optimizing for speed costs memory and vice versa. My algorithm uses a bit more memory (not much) but in return only scans the big array once.

public static int GetOffsetOfArrayInArray(byte[] bigArray, int bigArrayOffset, int bigArrayCount, byte[] smallArray)
{
    // TODO: Check whether none of the variables are null or out of range.
    if (smallArray.Length == 0)
        return 0;

    List<int> starts = new List<int>();    // Limited number of elements.

    int offset = bigArrayOffset;
    // A single pass through the big array.
    while (offset < bigArrayOffset + bigArrayCount)
    {
        for (int i = 0; i < starts.Count; i++)
        {
            if (bigArray[offset] != smallArray[offset - starts[i]])
            {
                // Remove starts[i] from the list.
                starts.RemoveAt(i);
                i--;
            }
            else if (offset - starts[i] == smallArray.Length - 1)
            {
                // Found a match.
                return starts[i];
            }
        }
        if (bigArray[offset] == smallArray[0] &&
            offset <= (bigArrayOffset + bigArrayCount - smallArray.Length))
        {
            if (smallArray.Length > 1)
                // Add the start to the list.
                starts.Add(offset);
            else
                // Found a match.
                return offset;
        }
        offset++;
    }
    return -1;
}

The list starts is used to track potential start offsets of smallArray in bigArray. It will never contain more elements than the number of occurrences of smallArray[0] in smallArray (which may be calculated in advance to optimize the list and reduce the number of memory reallocations). When there are not enough bytes left in bigArray to contain smallArray, it is not tried, and when smallArray has been found, the algorithm stops. It also stops when the end of bigArray has been reached. Therefor, the worst case running time would be O(1), and the memory usage would be O(1).

Further possible optimizations include using pointers in unsafe code, and replacing the list by a fixed array whose size can be calculated in advance (as noted before). However, because in the list wrong offsets are removed (smaller inner loop) and in an array wrong offsets have to be skipped (fixed size inner loop but possibly faster element access), you'd have to benchmark which one is faster.

It also matters whether you expect smallArray to be large or not. When you do, you could add a check to the while loop which checks whether starts.Length != 0 || offset <= (bigArrayOffset + bigArrayCount - smallArray.Length). Otherwise the loop may stop and no occurrences have been found.


Here's my take on a solution. It's split in two. The first part mainly searches for a potential start. If it finds one it the compares the list from both ends (to lower the loop count which is basically a micro optimization with out a profiler but usually it's faster)

int GetOffsetOfArrayInArray(byte[] bigArray,
                        int bigArrayOffset,
                        int bigArrayCount,
                        byte[] smallArray)
    {
        var length = smallArray.Length;
        var lastPossibleStart = bigArray.Length - length;
        var startByte = smallArray[0];

        for (var first = bigArrayOffset; first < lastPossibleStart; first++)
        {
           if (bigArray[first] == startByte &&
               check(bigArray, smallArray, first, length))
           {
              return first;
           }
        }
        return -1;
    }

    bool check(byte[] bigArray, byte[] smallArray, int first, int length)
    {
        var smallIndex = 0;
        var smallLast = length - 1;
        var last = first + length - 1;
        for (var i = first; smallIndex <= smallLast; i++)
        {
            if (bigArray[i] != smallArray[smallIndex] ||
                 bigArray[last] != smallArray[smallLast])
            {
                return false;
            }
            smallIndex = i - first + 1;
            last--;
            smallLast--;
        }
        return true;
    }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消