开发者

Removing non-breaking spaces from strings using Python

开发者 https://www.devze.com 2022-12-25 19:06 出处:网络
I am having some trouble with a very basic string issue in Python (that I开发者_如何学Python can\'t figure out). Basically, I am trying to dothe following:

I am having some trouble with a very basic string issue in Python (that I开发者_如何学Python can't figure out). Basically, I am trying to do the following:

'# read file into a string 
myString =  file.read()

'# Attempt to remove non breaking spaces 
myString = myString.replace("\u00A0"," ")

'# however, when I print my string to output to console, I get: 
Foo **<C2><A0>** Bar

I thought that the "\u00A0" was the escape code for unicode non breaking spaces, but apparently I am not doing this properly. Any ideas on what I am doing wrong?


You don't have a unicode string, but a UTF-8 list of bytes (which are what strings are in Python 2.x).

Try

myString = myString.replace("\xc2\xa0", " ")

Better would be to switch to unicode -- see this article for ideas. Thus you could say

uniString = unicode(myString, "UTF-8")
uniString = uniString.replace(u"\u00A0", " ")

and it should also work (caveat: I don't have Python 2.x available right now), although you will need to translate it back to bytes (binary) when sending it to a file or printing it to a screen.


I hesitate before adding another answer to an old question, but since Python3 counts a Unicode "non-break space" character as a whitespace character, and since strings are Unicode by default, you can get rid of non-break spaces in a string s using join and split, like this:

s = ' '.join(s.split())

This will, of course, also change any other white space (tabs, newlines, etc). You can find a list of Unicode characters that would be changed, in the table in the Whitespace character page on Wikipedia.

And note that this is Python3 only.


No, u"\u00A0" is the escape code for non-breaking spaces. "\u00A0" is 6 characters that are not any sort of escape code. Read this.


Please note that a simple myString.strip() will remove not only spaces, but also non-breaking-spaces from the beginning and end of myString. Not exactly what the OP asked for, but still very handy in many cases.


You can simply solve this issue by enforcing the encoding.

 cleaned_string = myString.encode('ascii', 'ignore')


Also note that python's whitespace regex character matches non-breaking spaces.

The following code will replace one-or-more spaces/non-breaking-spaces with a single space

import re

re.sub(r'\s+', ' ', u"String with    spaces and non\u00A0breaking\u00A0spaces")
# 'String with spaces and non breaking spaces'


There is no indication in what you write that you're necessarily doing anything wrong: if the original string had a non-breaking space between 'Foo' and 'Bar', you now have a normal space there instead. This assumes that at some point you've decoded your input string (which I imagine is a bytestring, unless you're on Python 3 or file was opened with the function from the codecs module) into a Unicode string, else you're unlikely to locate a unicode character in a non-unicode string of bytes, for the purposes of the replace. But still, there are no clear indications of problems in what you write.

Can you clarify what's the input (print repr(myString) just before the replace) and what's the output (print repr(myString) again just after the replace) and why you think that's a problem? Without the repr, strings that are actually different might look the same, but repr helps there.

0

精彩评论

暂无评论...
验证码 换一张
取 消