开发者

How to remove extended ascii using python?

开发者 https://www.devze.com 2022-12-11 04:39 出处:网络
In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to compla开发者_JAVA技巧in. The solution would be to strip out a

In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to compla开发者_JAVA技巧in. The solution would be to strip out all the non-ASCII chars in the PML.

So in attempting to fix this in python, I have

import unicodedata, fileinput

for line in fileinput.input():
    print unicodedata.normalize('NFKD', line).encode('ascii','ignore')

However, this results in an error that line must be "unicode, not str". Here's a file fragment.

\B1a\B \tintense, disordered and often destructive rage†.†.†.\t

Not quite sure how to properly pass line in to be processed at this point.


Try print line.decode('iso-8859-1').encode('ascii', 'ignore') -- that should be much closer to what you want.


You would like to treat line as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:

line.decode('ascii')

This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:

line.decode('ascii', 'ignore').

This gives you text, in the form of a unicode instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str or bytes instance (depending on your version of Python):

line.decode('ascii', 'ignore').encode('ascii')


To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore'). But probably you'd better use PLM escape sequences for them:

import re

def escape_unicode(m):
    return '\\U%04x' % ord(m.group())

non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)

line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)

This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t.

Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):

regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)


When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode method. eg:

line = line.decode('latin1')

Replace 'latin1' with the correct encoding.

0

精彩评论

暂无评论...
验证码 换一张
取 消