开发者

What is a null-terminated string?

开发者 https://www.devze.com 2022-12-16 11:23 出处:网络
How does it differ from std::strin开发者_运维技巧g?A \"string\" is really just an array of chars; a null-terminated string is one where a null character \'\\0\' marks the end of the string (not necess

How does it differ from std::strin开发者_运维技巧g?


A "string" is really just an array of chars; a null-terminated string is one where a null character '\0' marks the end of the string (not necessarily the end of the array). All strings in code (delimited by double quotes "") are automatically null-terminated by the compiler.

So for example, "hi" is the same as {'h', 'i', '\0'}.


A null-terminated string is a contiguous sequence of characters, the last one of which has the binary bit pattern all zeros. I'm not sure what you mean by a "usual string", but if you mean std::string, then a std::string is not required (until C++11) to be contiguous, and is not required to have a terminator. Also, a std::string's string data is always allocated and managed by the std::string object that contains it; for a null-terminated string, there is no such container, and you typically refer to and manage such strings using bare pointers.

All of this should really be covered in any decent C++ text book - I recommend getting hold of Accelerated C++, one of the best of them.


There are two main ways to represent a string:

1) A sequence of characters with an ASCII null (nul) character, 0, at the end. You can tell how long it is by searching for the terminator. This is called a null-terminated string, or sometimes nul-terminated.

2) A sequence of characters, plus a separate field (either an integer length, or a pointer to the end of the string), to tell you how long it is.

I'm not sure about "usual string", but what quite often happens is that when talking about a particular language, the word "string" is used to mean the standard representation for that language. So in Java, java.lang.String is a type 2 string, so that's what "string" means. In C, "string" probably means a type 1 string. The standard is quite verbose in order to be precise, but people always want to leave out what's "obvious".

In C++, unfortunately, both types are standard. std::string is a type 2 string[*], but standard library functions inherited from C operate on type 1 strings.

[*] Actually, std::string is often implemented as an array of characters, with a separate length field and a nul terminator. That's so that the c_str() function can be implemented without ever needing to copy or re-allocate the string data. I can't remember off-hand whether it's legal to implement std::string without storing a length field: the question is what complexity guarantees are required by the standard. For containers in general size() is recommended to be O(1), but isn't actually required to be. So even if it is legal, an implementation of std::string that just uses nul-terminators would be surprising.


'\0' 

is an ASCII character with code 0, null terminator, null character, NUL. In C language it serves as a reserved character used to signify the end of a string. Many standard functions such as strcpy, strlen, strcmp among others rely on this. Otherwise, if there was no NUL, another way to signal end of string must have been used:

This allows the string to be any length with only the overhead of one byte; the alternative of storing a count requires either a string length limit of 255 or an overhead of more than one byte.

from wikipedia

C++ std::string follows this other convention and its data is represented by a structure called _Rep:

// _Rep: string representation
      //   Invariants:
      //   1. String really contains _M_length + 1 characters: due to 21.3.4
      //      must be kept null-terminated.
      //   2. _M_capacity >= _M_length
      //      Allocated memory is always (_M_capacity + 1) * sizeof(_CharT).
      //   3. _M_refcount has three states:
      //      -1: leaked, one reference, no ref-copies allowed, non-const.
      //       0: one reference, non-const.
      //     n>0: n + 1 references, operations require a lock, const.
      //   4. All fields==0 is an empty string, given the extra storage
      //      beyond-the-end for a null terminator; thus, the shared
      //      empty string representation needs no constructor.

      struct _Rep_base
      {
    size_type       _M_length;
    size_type       _M_capacity;
    _Atomic_word        _M_refcount;
      };

struct _Rep : _Rep_base
      {
    // Types:
    typedef typename _Alloc::template rebind<char>::other _Raw_bytes_alloc;

    // (Public) Data members:

    // The maximum number of individual char_type elements of an
    // individual string is determined by _S_max_size. This is the
    // value that will be returned by max_size().  (Whereas npos
    // is the maximum number of bytes the allocator can allocate.)
    // If one was to divvy up the theoretical largest size string,
    // with a terminating character and m _CharT elements, it'd
    // look like this:
    // npos = sizeof(_Rep) + (m * sizeof(_CharT)) + sizeof(_CharT)
    // Solving for m:
    // m = ((npos - sizeof(_Rep))/sizeof(CharT)) - 1
    // In addition, this implementation quarters this amount.
    static const size_type  _S_max_size;
    static const _CharT _S_terminal;

    // The following storage is init'd to 0 by the linker, resulting
        // (carefully) in an empty string with one reference.
        static size_type _S_empty_rep_storage[];

        static _Rep&
        _S_empty_rep()
        { 
      // NB: Mild hack to avoid strict-aliasing warnings.  Note that
      // _S_empty_rep_storage is never modified and the punning should
      // be reasonably safe in this case.
      void* __p = reinterpret_cast<void*>(&_S_empty_rep_storage);
      return *reinterpret_cast<_Rep*>(__p);
    }

        bool
    _M_is_leaked() const
        { return this->_M_refcount < 0; }

        bool
    _M_is_shared() const
        { return this->_M_refcount > 0; }

        void
    _M_set_leaked()
        { this->_M_refcount = -1; }

        void
    _M_set_sharable()
        { this->_M_refcount = 0; }

    void
    _M_set_length_and_sharable(size_type __n)
    {
#ifndef _GLIBCXX_FULLY_DYNAMIC_STRING
      if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
        {
          this->_M_set_sharable();  // One reference.
          this->_M_length = __n;
          traits_type::assign(this->_M_refdata()[__n], _S_terminal);
          // grrr. (per 21.3.4)
          // You cannot leave those LWG people alone for a second.
        }
    }

    _CharT*
    _M_refdata() throw()
    { return reinterpret_cast<_CharT*>(this + 1); }

    _CharT*
    _M_grab(const _Alloc& __alloc1, const _Alloc& __alloc2)
    {
      return (!_M_is_leaked() && __alloc1 == __alloc2)
              ? _M_refcopy() : _M_clone(__alloc1);
    }

    // Create & Destroy
    static _Rep*
    _S_create(size_type, size_type, const _Alloc&);

    void
    _M_dispose(const _Alloc& __a)
    {
#ifndef _GLIBCXX_FULLY_DYNAMIC_STRING
      if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
        if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
                               -1) <= 0)
          _M_destroy(__a);
    }  // XXX MT

    void
    _M_destroy(const _Alloc&) throw();

    _CharT*
    _M_refcopy() throw()
    {
#ifndef _GLIBCXX_FULLY_DYNAMIC_STRING
      if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
            __gnu_cxx::__atomic_add_dispatch(&this->_M_refcount, 1);
      return _M_refdata();
    }  // XXX MT

    _CharT*
    _M_clone(const _Alloc&, size_type __res = 0);
      };

the actual data might be obtained with:

_Rep* _M_rep() const
      { return &((reinterpret_cast<_Rep*> (_M_data()))[-1]); }

this code snippet comes from file basic_string.h which on my machine is located in usr/include/c++/4.4/bits/basic_string.h

So as you can see, the difference is significant.


A null-terminated string means that the end of your string is defined through the occurrence of a null-char (all bits are zero).

"Other strings" e.g. have to store their own lenght.


A null-terminated string is a native string format in C. String literals, for example, are implemented as null-terminated. As a result, a whole lot of code (C run-time library to begin with) assumes that strings are null-terminated.


A null terminated string (c-string) is an array of char's, and the last element of the array being a 0x0 value. The std::string is essentially a vector, in that it is an auto-resizing container for values. It does not need a null terminator since it must keep track of size to know when a resize is needed.

Honestly, I prefer c-strings over std ones, they just have more applications in the basic libraries, the ones with minimal code and allocations, and the harder to use because of that.

0

精彩评论

暂无评论...
验证码 换一张
取 消