Danger of C# Substring method?_问答_开发者_运维开发者技术经验分享

Recently I have been reading up on some of the flaws with the Java substring method - specifically relating to memory, and how java keeps a reference to the original string. Ironically I am also developing a server application that uses C# .Net's implementation of substring many tens of times in开发者_Go百科 a second. That got me thinking...

Are there memory issues with the C# (.Net) string.Substring?
What is the performance like on string.Substring? Is there a faster way to split a string based on start/end position?

Looking at .NET's implementation of String.Substring, a substring does not share memory with the original.

private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
    if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
    {
        return this;
    }

    // Allocate new (separate) string
    string str = FastAllocateString(length);

    // Copy chars from old string to new string
    fixed (char* chRef = &str.m_firstChar)
    {
        fixed (char* chRef2 = &this.m_firstChar)
        {
            wstrcpy(chRef, chRef2 + startIndex, length);
        }
    }
    return str;
}

Every time you use substring you create a new string instance - it has to copy the character from the old string to the new, along with the associated new memory allocation — and don't forget that these are unicode characters. This may or not be a bad thing - at some point you want to use these characters somewhere anyway. Depending on what you're doing, you might want your own method that merely finds the proper indexes within the string that you can then use later.

Just to add another perspective on this.

Out of memory (most times) does not mean you've used up all the memory. It means that your memory has been fragmented and the next time you want to allocate a chunk the system is unable to find a contiguous chunk of memory to fit your needs.

Frequent allocations/deallocations will cause memory fragmentation. The GC may not be in a position to de-fragment in time sue to the kinds of operations you do. I know the Server GC in .NET is pretty good about de-fragmenting memory but you could always starve (preventing the GC from doing a collect) the system by writing bad code.

it is always good to try it out & measure the elapsed milliseconds.

Stopwatch watch = new Stopwatch();
watch.Start();
// run string.Substirng code
watch.Stop();
watch.ElapsedMilliseconds();

In the case of the Java memory leak one may experience when using subString, it's easily fixed by instantiating a new String object with the copy constructor (that is a call of the form "new String(String)"). By using that you can discard all references to the original (and in the case that this is actually an issue, rather large) String, and maintain only the parts of it you need in memory.

Not ideal, in theory the JVM could be more clever and compress the String object (as was suggested above), but this gets the job done with what we have now.

As for C#, as has been said, this problem doesn't exist.

The CLR (hence C#'s) implementation of Substring does not retain a reference to the source string, so it does not have the "memory leak" problem of Java strings.

most of these type of string issues are because String is immutable. The StringBuilder class is intended for when you are doing a lot of string manipulations:

http://msdn.microsoft.com/en-us/library/2839d5h5(VS.71).aspx

Note that the real issue is memory allocation rather than CPU, although excessive memory alloc does take CPU...

I seem to recall that the strings in Java were stored as the actual characters along with a start and length.

This means that a substring string can share the same characters (since they're immutable) and only have to maintain a separate start and length.

So I'm not entirely certain what your memory issues are with the Java strings.

Regarding that article posted in your edit, it seems a bit of a non-issue to me.

Unless you're in the habit of making huge strings, then taking a small substring of them and leaving those lying around, this will have near-zero impact on memory.

Even if you had a 10M string and you made 400 substrings, you're only using that 10M for the underlying char array - it's not making 400 copies of that substring. The only memory impact is the start/length bit of each substring object.

The author seems to be complaining that they read a huge string into memory then only wanted a bit of it, but the entire thing was kept - my suggestion would be they they might want to rethink how they process their data :-)

To call this a Java bug is a huge stretch as well. A bug is something that doesn't work to specification. This was a deliberate design decision to improve performance, running out of memory because you don't understand how things work is not a bug, IMNSHO. And it's definitely not a memory leak.

There was one possible good suggestion in the comments to that article, that the GC could more aggressively recover bits of unused strings by compressing them.

This is not something you'd want to do on a first pass GC since it would be relatively expensive. However, where every other GC operation had failed to reclaim enough space, you could do it.

Unfortunately it would almost certainly mean that the underlying char array would need to keep a record of all the string objects that referenced it, so it could both figure out what bits were unused and modify all the string object start and length fields.

This in itself may introduce unacceptable performance impacts and, on top of that, if your memory is so short for this to be a problem, you may not even be able to allocate enough space for a smaller version of the string.

I think, if the memory's running out, I'd probably prefer not to be maintaining this char-array-to-string mapping to make this level of GC possible, instead I would prefer that memory to be used for my strings.

Since there is a perfectly acceptable workaround, and good coders should know about the foibles of their language of choice, I suspect the author is right - it won't be fixed.

Not because the Java developers are too lazy, but because it's not a problem.

You're free to implement your own string methods which match the C# ones (which don't share the underlying data except in certain limited scenarios). This will fix your memory problems but at the cost of a performance hit, since you have to copy the data every time you call substring. As with most things in IT (and life), it's a trade-off.

For profiling memory while developing you can use this code:

bool forceFullCollection = false;

Int64 valTotalMemoryBefore = System.GC.GetTotalMemory(forceFullCollection);

//call String.Substring

Int64 valTotalMemoryAfter = System.GC.GetTotalMemory(forceFullCollection);

Int64 valDifferenceMemorySize = valTotalMemoryAfter - valTotalMemoryBefore;

About parameter forceFullCollection: "If the forceFullCollection parameter is true, this method waits a short interval before returning while the system collects garbage and finalizes objects. The duration of the interval is an internally specified limit determined by the number of garbage collection cycles completed and the change in the amount of memory recovered between cycles. The garbage collector does not guarantee that all inaccessible memory is collected." GC.GetTotalMemory Method

Good luck!;)