TCL-script (Eggdrop) has problems with special characters_问答_开发者

I have installed Eggdrop on a new Debian server, but it keeps having issues with processing special characters.

Eggdrop is running utf-8. I have even manually enforced TCL encoding to utf-8 in the script. And I have tried recompiling Eggdrop with instructions from http://eggwiki.org/Utf-8.

22:00 <@me> !tr fr I have prepared lots of cookies for the entire family.
22:00 <@bot> J'ai prÃÂ©parÃÂ© beaucoup de biscuits pour toute la famille.
22:00 <@me> !tr ar The special character开发者_开发问答s are processed.
22:00 <@bot> ÃÂªÃÂªÃE ÃEÃÂ¹ÃÂ§ÃDÃÂ¬ÃÂ© ÃÂ§ÃDÃÂ£ÃÂÃÂ±ÃA ÃÂ§ÃDÃÂ®ÃÂ§ÃÂµÃÂ©.

(Also see a previous Question asked, that did not get solved: Issues with TCL encoding on Eggdrop)

namespace eval gTranslator {

# Factor this out into a helper
proc getJson url {
  set tok [http::geturl $url]
  set res [json::json2dict [http::data $tok]]
  http::cleanup $tok
  return $res
}
# How to decode _decimal_ entities; WARNING: high magic factor within!
proc decodeEntities str {
  set str [string map {\[ {\[} \] {\]} \$ {\$} \\ \\\\} $str]
  subst [regsub -all {&#(\d+);} $str {[format %c \1]}]
}

bind pub - !tr gTranslator::translate
proc translate { nick uhost handle chan text } {
  package require http
  package require json
  set lngto [string tolower [lindex [split $text] 0]]
  set text [http::formatQuery q [join [lrange [split $text] 1 end]]]
  set dturl "http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q=$text"

  set lng [dict get [getJson $dturl] responseData language]

  if { $lng == $lngto } {
    putserv "PRIVMSG $chan :\002Error\002 translating $lng to $lngto."
    return 0
  }
  set trurl "http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=$lng%7c$lngto&$text"
  putlog $trurl

  set res [getJson $trurl]

  putlog $res
  #putserv "PRIVMSG $chan :Language detected: $lng"

  set translated [decodeEntities [dict get $res responseData translatedText]]

  putserv "PRIVMSG $chan :[encoding convertto utf-8 $translated]"
}
}

That ugly mess you are seeing is UTF-8 interpreted as ISO 8859-1. It indicates that somewhere there's a misinterpretation of what characters mean, and can be caused by either getting wires crossed over a communication channel, or by an extra round of encoding being applied. Because there are rather a lot of moving parts involved (IRC client, IRC server, eggdrop, your script, Google translate) it is necessary to talk you through debugging.

Tcl and Google communicate correctly with each other (I've double-checked the code) so we can eliminate that possibility. The problem is therefore between your IRC client, the IRC server, and eggdrop; if they don't agree on what the interpretation of the bytes “on the wire” is, you get mangling.

You can add (or remove) mangling in the script through the use of encoding convertto (and encoding convertfrom) but it is necessary to be clear what you are doing in order to get it right. In memory, Tcl represents strings as sequences of abstract Unicode characters; the way in which they are “written down” in memory is not your business (and in fact varies from time to time in a complex way that's almost always highly efficient in terms of run-time). If there is a general agreement that the IRC server's channel will be passing through UTF-8, your requirement then is to:

Make sure that the eggdrop script writes UTF8-encoded characters to the channel.
Make sure that your client reads UTF8-encoded characters from the channel.

Dealing with the first point, I can't remember if eggdrop handles encodings automatically for you or not. If it does, you just do this in the final stage of your binding:

putserv "PRIVMSG $chan :$translated"

If it does not, you do this:

putserv "PRIVMSG $chan :[encoding convertto utf-8 $translated]"

Experiment. Use the right one.

On the second point (the client), explore its settings and get it right. Be aware that there can be additional problems if the client is running in a situation where it cannot display all Unicode characters correctly (a common problem if running in a terminal). There's nothing that your eggdrop script can do to fix that.

It may be worth noting that, if the creator of the data encodes it in "encoding a" and it's read in "encoding b", then the text is already broken by the time you're looking at it. You can't just tell Tcl to encode it in another encoding and expect it to work.

Consider it something like: