开发者

javascript and string manipulation w/ utf-16 surrogate pairs

开发者 https://www.devze.com 2023-03-23 20:39 出处:网络
I\'m working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I\'ve got to recode some stu

I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware.

I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings.

function sortSurrogates(str){
  var cp = [];                 // array to hold code points
  while(str.length){           // loop till we've done the whole string
    if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
                               // High surrogate found low surrogate follows
      cp.push(str.substr(0,2)); // push the two onto array
      str = str.substr(2);     // clip the two off the string
    }else{                     // else BMP code point
      cp.push(str.substr(0,1)); // push one onto array
      str = str.substr(1);     // clip one from string 
    }
  }                            // loop
  return cp;                   // return the array
}

My question is, is there something simpler I'm missing? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. Am I missing something simple?

EDIT: To help illustrate the issue:

var a = "0123456789"; // U+0030 - U+003开发者_JAVA百科9 2 bytes each
var b = "
0

精彩评论

暂无评论...
验证码 换一张
取 消