开发者

How to filter chinese (ONLY chinese)

开发者 https://www.devze.com 2023-03-25 19:20 出处:网络
I want to convert some text that include some punctuation and full-width symbols to pure chinese text.

I want to convert some text that include some punctuation and full-width symbols to pure chinese text.

maybe_re = re.compile("xxxxxxxxxxxxxxxxx") #开发者_如何转开发TODO
print "".join(maybe_re.findall("你好,这只是一些中文文本..,.,全角"))

# I want out
你好这只是一些中文文本全角


I don't know of any good way to separate Chinese characters from other letters, but you can distinguish letters from other characters. Using regexes, you can use r"\w" (compiled with the re.UNICODE flag if you're on Python 2). That will include numbers as well as letters, but not punctuation.

unicodedata.category(c) will tell you what type of character c is. Your Chinese letters are "Lo" (letter without case), while the punctuation is "Po".


The Zhon library provides you with a list of Chinese punctuation marks: https://pypi.python.org/pypi/zhon

str = re.sub('[%s]' % zhon.unicode.PUNCTUATION, "", "你好,这只是一些中文文本..,.,全角")

This does almost what you want. Not exactly, because the sentence you provide contains some very non-standard punctuation marks, such as ".". Anyway, I think Zhon might be useful to others with a similar issue.

0

精彩评论

暂无评论...
验证码 换一张
取 消