So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...
The thing is, I do a query search on Twitter Search API with this call:
query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)
Then, I call a method (avaliar_pesquisa()
) to evaluate the results I've got, based on the tags (or terms) of the input:
dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))
On avaliar_pesquisa()
, the following happens:
def avaliar_pesquisa(dados, tags):
resultados = []
# Percorre os resultados
for i in dados['results']
resultados.append({'texto' : i['text'],
'imagem' : i['profile_image_url'],
'classificacao' : avaliar_texto(i['text'], tags),
'timestamp' : i['created_at'],
})
Note the avaliar_texto()
which evaluates the Tweet text. And there's exactly the problem on the following lines:
def avaliar_texto(texto, tags):
# Remove accents
from unicodedata import normalize
def strip_accents(txt):
return normalize('NFKD', txt.decode('utf-8'))
# Split
texto_split = strip_accents(texto)
texto_split = texto.lower().split()
# Remove non-alpha characters
import re
pattern = re.compile('[\W_]+')
texto_aux = []
for i in texto_split:
texto_aux.append(pattern.sub('', i))
texto_split = texto_aux
The split doesn't really matter here.
The thing is, if I print the type of the var texto
on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode.
S开发者_JAVA百科o, I get this error running the application that receives 100 tweets max as answer:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 17: ordinal not in range(128)
For the following text:
Text: Agora o problema é com o speedy. type 'unicode'
Any ideas?
See this page.
The decode()
method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.
Try return normalize('NFKD', unicode(txt) )
.
This is what I used in my code to discard accents, etc.
text = unicodedata.normalize('NFD', text).encode('ascii','ignore')
Ty placing:
# -*- coding: utf-8 -*-
at the beginning of your python script containing the code.
精彩评论