In python how do you deal with other encodings in domain names_问答_开发者

In python how do you deal with other encodings in domain names

开发者 https://www.devze.com 2023-02-28 13:54 出处：网络

I\'m trying to parse domain names from the Message-ID field of an email that\'s been loaded from a file and compare it to the domain of the from field to see how well it matches up. Then I compare the

I'm trying to parse domain names from the Message-ID field of an email that's been loaded from a file and compare it to the domain of the from field to see how well it matches up. Then I compare the distance using nltk.edit_distance().

I'm using

re.search('@[\[\]\w+\.]+',mail['Message-ID']).group()[1:]

but one spam message has the following

mail2['Message-ID']
'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>开发者_如何学JAVA;'

So when I try and match that it doesn't return a match in group()

I can decode it in Shift_JIS, but don't know what to do with it from there <2011315123.04C6DACE618A7C2763810@これから見えるだろう>

I don't want to try and check for every possible character encoding.

Any ideas of what I should do with it?

You can try the chardet project, which uses an algorithm to guess the character encoding:

import chardet

text = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7' + \
    '\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
cset = chardet.detect(text)
print cset
encoding = cset['encoding']
print encoding, text.decode(encoding)

Output:

{'confidence': 1, 'encoding': 'SHIFT_JIS'}
SHIFT_JIS <2011315123.04C6DACE618A7C2763810@これから見えるだろう>

In python how do you deal with other encodings in domain names

精彩评论

关注公众号

热门标签

图文推荐

In python how do you deal with other encodings in domain names

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：