开发者

Some characters (trademark sign, etc) unable to write to a file but is printable on the screen

开发者 https://www.devze.com 2023-02-13 22:22 出处:网络
I\'ve been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don\'t run into Unicode errors but when the data has the following characters

I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.

I've tried the encode and decode functions and the various encodings but to no avail.

Please find an excerpt of the current code that I've written below:

import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from stri开发者_StackOverflow中文版ng import maketrans
import codecs

f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...


soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding

for company in iter(soup5.findAll(height="20px")):
    stream = ""
    count_detail = 1
    for tag in iter(company.findAll('td')):
        if count_detail > 1:
           stream = stream + tag.text.replace(u',',u';')
           if count_detail < 4 :
              stream=stream+","
        count_detail = count_detail + 1
    stream.strip()
    try:
        f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
    except:
        print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream


Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:

f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)

... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:

line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))


Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号