I'm having trouble getting a replace() to work
I've tried my_string.replac开发者_如何学JAVAe('\\', '')
and re.sub('\\', '', my_string)
, but neither one works.
I thought \ was the escape code for backslash, am I wrong?
The string in question looks like
'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
or print my_string
<2011315123.04C6DACE618A7C2763810@???ꂩ?猩???邾?낤>
Yes, it's supposed to look like garbage, but I'd rather get
'<2011315123.04C6DACE618A7C2763810@82b182ea82a982e78ca982a682e982be82eb82a4>'
You don't have any backslashes in your string. What you don't have, you can't remove.
Consider what you are showing as '\x82'
... this is a one-byte string.
>>> s = '\x82'
>>> len(s)
1
>>> ord(s)
130
>>> hex(ord(s))
'0x82'
>>> print s
é # my sys.stdout.encoding is 'cp850'
>>> print repr(s)
'\x82'
>>>
What you'd "rather get" ('x82'
) is meaningless.
Update The "non-ascii" part of the string (bounded by @
and >
) is actually Japanese text written mostly in Hiragana and encoded using shift_jis
. Transcript of IDLE session:
>>> y = '\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4'
>>> print y.decode('shift_jis')
これから見えるだろう
Google Translate produces "Can not you see the future" as the English translation.
In a comment on another answer, you say:
I just need ascii
and
What I'm doing with it is seeing how far apart the two strings are using nltk.edit_distance(), so this will give me a multiple of the true distance. Which is good enough for me.
Why do you think you need ASCII? Edit distance is defined quite independently of any alphabet.
For a start, doing nonsensical transformations of your strings won't give you a consistent or predicable multiple of the true distance. Secondly, out of the following:
x
repr(x)
repr(x).replace('\\', '')
repr(x).replace('\\x', '') # if \ is noise, so is x
x.decode(whatever_the_encoding_is)
why do you choose the third?
Update 2 in response to comments:
(1) You still haven't said why you think you need "ascii". nltk.edit_distance doesn't require "ascii" -- the args are said to be "strings" (whatever that means) but the code will work with any 2 sequences of objects for which !=
works. In other words, why not just use the first of the above 5 options?
(2) Accepting up to 100% inflation of the edit distance is somwhat astonishing. Note that your currently chosen method will use 4 symbols (hex digits) per Japanese character. repr(x)
uses 8 symbols per character. x
(the first option) uses 2.
(3) You can mitigate the inflation effect by normalising your edit distance. Instead of comparing distance(s1, s2)
with a number_of_symbols threshold, compare distance(s1, s2) / float(max(len(s1), len(s2)))
with a fraction threshold. Note normalisation is usually used anyway ... the rationale being that the dissimilarity between 20-symbol strings with an edit distance of 4 is about the same as that between 10-symbol strings with an edit distance of 2, not twice as much.
(4) nltk.edit_distance is the most shockingly inefficient pure-Python implementation of edit_distance that I've ever seen. This implementation by Magnus Lie Hetland is much better, but still capable of improvement.
This works i think if you really want to just strip the "\"
>>> a = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
>>> repr(a).replace("\\","")[1:-1]
'<2011315123.04C6DACE618A7C2763810@x82xb1x82xeax82xa9x82xe7x8cxa9x82xa6x82xe9x82xbex82xebx82xa4>'
>>>
But like the answer above, what you get is pretty much meaningless.
精彩评论