As we all know, strings in .NET are immutable. (Well, not 100% totally immutable开发者_JAVA技巧, but immutable by design and used as such by any reasonable person, anyway.)
This makes it basically OK that, for example, the following code just stores a reference to the same string in two variables:
string x = "shark";
string y = x.Substring(0);
// Proof:
fixed (char* c = y)
{
c[4] = 'p';
}
Console.WriteLine(x);
Console.WriteLine(y);
The above outputs:
sharp
sharp
Clearly x
and y
refer to the same string
object. So here's my question: why wouldn't Substring
always share state with the source string? A string is essentially a char*
pointer with a length, right? So it seems to me the following should at least in theory be allowed to allocate a single block of memory to hold 5 characters, with two variables simply pointing to different locations within that (immutable) block:
string x = "shark";
string y = x.Substring(1);
// Does c[0] point to the same location as x[1]?
fixed (char* c = y)
{
c[0] = 'p';
}
// Apparently not...
Console.WriteLine(x);
Console.WriteLine(y);
The above outputs:
shark
park
For two reasons:
The string meta data (e.g. length) is stored in the same memory block as the characters, to allow one string to use part of the character data of another string would mean that you would have to allocate two memory blocks for most strings instead of one. As most strings are not substrings of other strings, that extra memory allocation would be more memory consuming than what you could gain by reusing part of strings.
There is an extra NUL character stored after the last character of the string, to make the string also usable by system functions that expect a null terminated string. You can't put that extra NUL character after a substring that is part of another string.
I believe C# strings are null terminated - while this is an implementation detail that shouldn't concern managed consumers, there are some cases (e.g. marshaling) where it's important.
Also if a substring shared a buffer with a much longer string, this means a reference to the short substring would prevent the longer string from being collected. And the possibility of a rats nest of string references that refer to the same buffer.
To add to the other answers:
Apparently, the Java standard classes do this: The string returned by String.substring()
reuses the internal character array of the original string (source, or look at the JDK sources by Sun).
The problem is that this means that the original String cannot be GCed until all the substrings are eligible for GC as well (as they share the backing character array). This can lead to wasted memory if you start out with a large string, and extract some smaller strings out of it, then discard the big string. That would be common when parsing an input file, for example.
Of course, a clever GC might work around this by copying the character array when it is worth it (the Sun JVM may do this, I don't know), but the added complexity might be a reason not to implement this sharing behaviour at all.
There are a number of ways something like String could be implemented:
- Have a "String" object effectively contain an array, with the implication that all characters in the array are in the string. This is what .net actually does.
- Have every "String" be a class which contains an array reference along with a starting offset and length. Problem: Creating most strings would require instantiating two objects rather than one.
- Have every "String" be a structure which contains an array reference along with a starting offset and length. Problem: Assignments to string type fields would no longer be atomic.
- Have two or more types of "String" objects--those which contain all the characters in an array, and those which contain a reference to another string along with an offset and length. Problem: This would require many methods of string to be virtual.
- Have every "String" be a special class which includes a starting offset and length, an object reference to what may or may not be the same object, and a built-in array of characters. This would waste a little space in the common case where a string contains its own characters (because all of them), but would allow the same code to work with strings that contain their own characters or strings that 'borrow' from others.
- Have a general-purpose ImmutableArray<T> type (which would inherit ReadableArray<T>), and have an ImmutableArray<Char> be interchangeable with String. There are many uses for immutable arrays; String is probably the most common usage case, but hardly the only one.
- Have a general-purpose ImmutableArray type<T> type as above, but also an ImmutableArraySegment<T> class, both inheriting from ImmutableArrayBase<T>. This would require many methods to be virtual, and would probably be my favorite possibility.
Note that most of these approaches have significant limitations in at least some usage scenarios.
I believe these are CLR optimisations that have nothing to do with programmers as you shouldn't be doing the things you are doing. You should assume it is a new string every time (as a programmer).
after reviewing Substring method with reflector i figured out that if you pass 0 in substriong method - it will return the same object.
[SecurityCritical]
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
{
return this;
}
string str = FastAllocateString(length);
fixed (char* chRef = &str.m_firstChar)
{
fixed (char* chRef2 = &this.m_firstChar)
{
wstrcpy(chRef, chRef2 + startIndex, length);
}
}
return str;
}
This would add complexity (or at least more smarts) to the intern table. Imagine you already have two entries in the intern table "pending" and "bending" and the following code:
var x = "pending";
var y = x.Substring(1);
which entry in the intern table would be considered a hit?
精彩评论