开发者

Character decoding Conversion Function Implementation

开发者 https://www.devze.com 2022-12-11 11:37 出处:网络
I need to implement a character encoding conversion function in C++ or C( Most desired ) from a custom enco开发者_StackOverflow社区ding scheme( to support multiple languages in single encoding ) to UT

I need to implement a character encoding conversion function in C++ or C( Most desired ) from a custom enco开发者_StackOverflow社区ding scheme( to support multiple languages in single encoding ) to UTF-8.

Our encoding is pretty random , it looks like this

Because of the randomness of this mapping, I am thinking to use std::map for mapping our encoding to UTF and vice versa in two different maps ,and use this maps for conversion. Is their any optimized data structure or way to do it.


If your code points are contiguous, just make a big char * array and translate using that. I don't really understand what you mean by UTF-8 codepoint. UTF-8 has representations, and Unicode has codepoints. If you want code points, use an array of ints.

const int mycode_to_unicode [] = {
   0x00ff,
   0x0102,
   // etc.
 };

You could put a value like -1 if there are holes in your encoding to catch errors.

Going the other way is just making an array of structs of the same size of something like

struct {
   int mycode;
   int unicode;
};

copying the keys of the array into mycode and the values into unicode, and running it through qsort with a function which compares the values of unicode, then using bsearch with the same function to go from code point to your encoding.

This is assuming you want to use C.


An hashtable would surely be the fastest solution.

If a table is known upfront and never changes (as I understand it's the case), you can determine a perfect hash for it meaning that you will have no collision and assured costant retrieve time (at the expense of possibily some space).

I've used gperf a couple of times but I suggest you to check Bob Jenkins great page on hashing (and minimal perfect hashing as well)


As you build the constant mappings upfront and use it only for lookups, a hash table might be more ideal than std::map. There is no hash table implementation in the C++ standard, but many free implementations are available, both in C and C++.

These are C implementations:

http://www.cl.cam.ac.uk/~cwc22/hashtable/

http://wiki.portugal-a-programar.org/c:snippet:hash_table_c

Glibc hash tables.


Not sure if I understand the question, but if it's not too big a 1:1 mapping , using a preinitialized struct may be the way to go (depending on the code, you could write a program to once emit the content of the init table):

struct MAP { int from, to; };

MAP somemapping[MAXMAP]= {
    { 0x101,  0x01 },
    { 0x102,  0x02 },

};

Using bsearch() would be a reasonably quick way to do lookups;

If the code is extremely performance senstitive, you could build an index based lookup table:

int lookup[65536];


/* init build lookup table once */
init() 
{
  for (int i= 0; i<MAXMAP; i++) {
     lookup[somemapping[i].from]= somemapping[i].to;
  }
}



foo() 
{
  ....
   /* quick lookup */
  to= lookup[from];
  ....
}
0

精彩评论

暂无评论...
验证码 换一张
取 消