开发者

Charset detection in PHP

开发者 https://www.devze.com 2023-02-20 22:16 出处:网络
//i\'ve added a new take on this please see Cheating PHP integers . any help will be much appreciated. I\'ve had an idea to trying and hack the storage option of the arrays by packing the integers int

//i've added a new take on this please see Cheating PHP integers . any help will be much appreciated. I've had an idea to trying and hack the storage option of the arrays by packing the integers into unsigned bytes (only need 8 or 16 bits integers to reduce the memory considerably).

Hi

I'm currently working on custom charset detection libraries and created a port from Mozilla's charset detection algorithm and开发者_运维技巧 used chardet (the python port) for a helping hand. However, this is extremely memory intensive in PHP (around 30mb of memory if I just load in Western language detection). I've optimised all I can without rewriting it from scratch to load each piece (this would reduce memory but make it a lot slower).

My question is that, do you know of any LGPL PHP libraries that do charset detection? This would be purely for research to give me a slight guiding hand in the right direction.

I already know of mb_detect_encoding but it's far too limited and brings up far too many false positives with the text files i have (yet python's chardet detects them perfectly)


I created a method which encodes correctly to UTF-8. But it was hard to figure out what is currently encoded so I came to this solution:

<?php
function _convert($content) { 
    if(!mb_check_encoding($content, 'UTF-8')
        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

        $content = mb_convert_encoding($content, 'UTF-8');

        if (mb_check_encoding($content, 'UTF-8')) {
            // log('Converted to UTF-8');
        } else {
            // log('Could not converted to UTF-8');
        }
    }
    return $content;
}
?>

As you can see I do a conversion to check if it still the same (UTF-8/16) and if not convert it. Maybe you can use some of this code.


First of all, interesting project you are working on! I'm curious how the end product will be.

Have you take a look at the ICU project already?

0

精彩评论

暂无评论...
验证码 换一张
取 消