Trimming UTF8 buffer_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-08 03:01 出处：网络

I have a buffer with UTF8 data. I need to remove the leading and trailing spaces. Here is the C code which does it (in place) for ASCII buffer:

相关专题：c

I have a buffer with UTF8 data. I need to remove the leading and trailing spaces. Here is the C code which does it (in place) for ASCII buffer:



char *trim(char *s)
{
  while( isspace(*s) )
    memmove( s, s+1, strlen(s) );
  while( *s && isspace(s[strlen(s)-1]) )
    s[strlen(开发者_运维问答s)-1] = 0;
  return s;
}

How to do the same for UTF8 buffer in C/C++?

P.S. Thanks for perfomance tip regarding strlen(). Back to UTF8 specific: what if I need to remove all spaces all together, not only at beginning and at the tail? Also I may need to remove all characters with ASCII code <32. Is any specific here for UTF8 case, like using mbstowcs()?

Do you want to remove all of the various Unicode spaces too, or just ASCII spaces? In the latter case you don't need to modify the code at all.

In any case, the method you're using that repeatedly calls strlen is extremely inefficient. It turns a simple O(n) operation into at least O(n^2).

Edit: Here's some code for your updated problem, assuming you only want to strip ASCII spaces and control characters:

unsigned char *in, *out;
for (out = in; *in; in++) if (*in > 32) *out++ = *in;
*out = 0;

strlen() scans to the end of the string, so calling it multiple times, as in your code, is very inefficient.

Try looking for the first non-space and the last non-space and then memmove the substring:

char *trim(char *s)
{
  char *first;
  char *last;

  first = s;
  while(isspace(*first))
    ++first;

  last = first + strlen(first) - 1;
  while(last > first && isspace(*last))
    --last;

  memmove(s, first, last - first + 1);
  s[last - first + 1] = '\0';

  return s;
}

Also remember that the code modifies its argument.