What am I doing wrong/what can I do?
import sys
import string
def remove(file):
punctuation = string.punctuation
for ch in file:
if len(ch) > 1:
开发者_C百科 print('error - ch is larger than 1 --| {0} |--'.format(ch))
if ch in punctuation:
ch = ' '
return ch
else:
return ch
ref = (open("ref.txt","r"))
test_file = (open("test.txt", "r"))
dictionary = ref.read().split()
file = test_file.read().lower()
file = remove(file)
print(file)
This is in Python 3.1.2
In this code...:
for ch in file:
if len(ch) > 1:
the weirdly-named file
(besides breaking the best practice of not hiding builtin names with your own identifier) is not a file, it's a string -- which means unicode, in Python 3, but that makes no difference to the fact that the loop is returning single characters (unicode characters, not bytes, in Python 3) so len(ch) == 1
is absolutely guaranteed by the rules of the Python language. Not sure what you're trying to accomplish with that test (rule out some subset of unicode characters?), but, whatever it is you thing you're achieving, I assure you that you're not achieving it and should recode that part.
Apart from this, you're returning -- and therefore exiting the function -- immediately, and thereby exiting the function and returning just one character (the first one in the file, or a space if that first one was a punctuation character).
The suggestion to use the translate
method, which I saw in another answer, is the right one, but that answer used the wrong version of translate
(one applying to byte strings, not to unicode strings as you need for Python 3). The proper unicode version is simpler, and transforms the whole body of your function into just two statements:
trans = dict.fromkeys(map(ord, string.punctuation), ' ')
return file.translate(trans)
In python, strings are immutable, so you need to create a new string with your changes.
There are a few ways to do this:
One is using a list comprehension to inspect the characters and only returning the non-punctuation.
def remove(file):
return ''.join(ch for ch in file if ch not in string.punctuation)
You could also call functions to test the character or translate the character which you might have throw "weird character" exceptions or do some other functionality:
def remove(file):
return ''.join(TranslateCh(ch) for ch in file if CheckCh(ch))
Another alternative is the string
module, providing replace
or translate
. Translate provides a nice (and more efficient than building a list) mechanism for this, see Alex's answer.
Or... you could collect a list over a for
loop and join it at the end, but that's a little "unpythonic".
Check out the re (regular expression) module. It has a "sub" function to replace strings that match regular expressions.
精彩评论