I was trying to unify the lines in my file when I observed the following:
word1 word2
word1 word2I did not understand why these line开发者_运维知识库s were not combined so I opened the file in vim and used :set list
to see if there are any special characters and I found this:
word1 <feff>word2
word1 word2
I am not sure how to clean this word in Python. Any suggestions on what character might be and how this can be cleaned?
U+FEFF is the Byte Order Mark character, which should only occur at the start of a document. In documents, it should be treated as a ZERO WIDTH NON-BREAKING SPACE
. If this causes issues, you can remove it like any other character:
>>> s = u'word1 \ufeffword2'
>>> s = s.replace(u'\ufeff', '')
>>> s
u'word1 word2'
(In Python 3.1 or 3.2, drop the u
in front of strings)
Have you tried mytext.split(string.whitespace)
?
精彩评论