开发者

convert encoding via iconv linux

开发者 https://www.devze.com 2023-02-03 05:02 出处:网络
I used to convert encoding via iconv but today i stopped by something new to me I made a testcase to make my question clear :

I used to convert encoding via iconv but today i stopped by something new to me

I made a testcase to make my question clear :

the goal is convert الحلقة الثالثة to its utf8 version : الحلقة الثالثة

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title> this text is from arabic language   </title>
</head>
<body>
<p><span> &#开发者_如何学C1575;&#1604;&#1581;&#1604;&#1602;&#1577; &#1575;&#1604;&#1579;&#1575;&#1604;&#1579;&#1577;</span></p>
</body>
</html>

tried to use encoding like ASCII , LATIN1 , windows-1252 but with no luck how do i tell what is this type of encoding in order to convert it ?? both of google translate + stackoverflow editor was able to detect it and covert it ?

another example : this website http://kanjidict.stc.cx/recode.php was able to convert the encoding correctly if i check the Assume HTML (default: handle as plain text)

what i am missing and those 3 websites was do it to convert it correctly ????


Well ,

after one day working , i have found my lost command , its a package i had installed called ascii2uni

simply by : sudo apt-get install ascii2uni

and after some testing i was able to convert one file to unicode by using this command

ascii2uni -a D source.html > target.html

and i was able to convert it using command line only

cheers


The idea is string substitution. Coding in Python3.

parse decimal only:

>>> import re
>>> s = r'&#65;&#223;&#254;'
>>> r = re.compile(r'&#(\d+);')
>>> r.sub(lambda m:chr(int(m.group(1))), s)
'Aßþ'

parse hex and decimal:

>>> import re
>>> s = r'&#x41;&#223;&#xFE;'
>>> r = re.compile(r'&#(x?)(\w+);')
>>> r.sub(lambda m:chr(int(m.group(2), 10 if not m.group(1) else 16)), s)
'Aßþ'


Those numbers are called letter codes. There are special functions, related to url and html processing, that handle them - depending which language are you using.


In PHP, there is http://www.php.net/manual/en/function.htmlspecialchars-decode.php In other languages should be similar functions also


recode html..utf8

this should work too, but pls make sure you read the usage manual for recode, it recodes files in place if not told otherwise.

0

精彩评论

暂无评论...
验证码 换一张
取 消