开发者

Compressing strings with common parts

开发者 https://www.devze.com 2023-04-12 22:03 出处:网络
I have an application that manages a large number of strings. Strings are in a path-like format and have many common parts, but without a clear rule. They are not paths on the file-system but can be c

I have an application that manages a large number of strings. Strings are in a path-like format and have many common parts, but without a clear rule. They are not paths on the file-system but can be considered like so. I clearly need to optimize memory consumption but without a big performance sacrifice.

I am considering 2 options:

- implement a compressed_string class that stores data zipped, but i need a fixed dictionary and i cant find a library for this right now. I don't want a Huffman on bytes, I want it on words.

- implement some kind of flyweight pattern on string parts.

The problem looks like a common one and I'm wonder what is the best solution to it or if someone knows a library that target开发者_Python百科s this issue.

thanks


Although it might be tempting to tune a specific algorithm for your problem, it is likely to require an unreasonable amount of time and effort, while standard compression techniques will immediately provide you a great boost to solve your memory consumption problem.

The "standard" way to handle this issue is to chunk source data into small blocks (such as 256KB), and compress them individually. When accessing data into the block, you need to decode it first. Therefore, the optimal block size really depends on your application, i.e. the more the application streams, the larger the blocks; on the other hand, the more random access pattern, the smaller the block size.

If you are worried by the compression/decompression speed, use a high-speed algorithm. If decompression speed is the most important metric (for access time), something like LZ4 will provide you about 1GB/s decoding performance per core, so this gives you an idea of how many blocks per second you can decode.

If only decompression speed matters, you may use the high-compression variant LZ4-HC, which will boost compression ratio even more by about 30%, while also improving decompression speed.


Strings are in a path-like format and have many common parts, but without a clear rule.

In the sense that they are locators in a hierarchy of the form name, (separator, name)*? If so, you can use interning: store the name parts as char const * elements that point into a pool of strings. That way, you effectively compress a name that is used n times to just over n * sizeof(char const *) + strlen(name) bytes. The full path would become a sequence of interned names, e.g. an std::vector.

It might seem that sizeof(char const *) is big on 64-bit hardware, but you also save some of the allocation overhead. Or, if you know for some reason that you'll never need more than, say, 65536 strings, you might store them as

class interned_name
{
    uint16_t tab_idx;

  public:
    char const *c_str() const
    {
        return NAME_TABLE[tab_idx];
    }
};

where NAME_TABLE is an static std::unordered_map<uint16_t, char const *>.

0

精彩评论

暂无评论...
验证码 换一张
取 消