Exceptions with Unicode what()_问答_开发者_运维开发者技术经验分享

Or, "how do Russians throw exceptions?"

The definition of std::exception is:

namespace std {
  class exception {
  public:
    exception() throw();
    exception(const exception&) throw();
    exception& operator=(const exception&) throw();
    virtual ~exception() throw();
    virtual const char* what() const throw();
  };
}

A popular school of thought for designing exception hierarchies is to derive from std::exception:

Generally, it's best to throw objects, not built-ins. If possible, you should throw instances of classes that derive (ultimately) from the std::exception class. By making your exception class inherit (ultimately) from the standard exception base-class, you are making life easier for your users (they have the option of catching most things via std::exception), plus you are probably provid开发者_Python百科ing them with more information (such as the fact that your particular exception might be a refinement of std::runtime_error or whatever).

But in the face of Unicode, it seems to be impossible to design an exception hierarchy that achieves both of the following:

Derives ultimately from std::exception for ease of use at the catch site
Provides Unicode compatibility so that diagnostics are not sliced or gibberish

Coming up with an exception class that can be constructed with Unicode strings is simple enough. But the standard dictates that what() must return a const char*, so at some point the input strings must be converted to ASCII. Whether that is done at construction time or when what() is called (if the source string uses characters not representable by 7-bit ASCII), it might be impossible to format the message without loss of fidelity.

How do you design an exception hierarchy that combines the seamless integration of a std::exception-derived class with lossless Unicode diagnostics?

char* does not mean ASCII. You could use an 8 bit Unicode encoding like UTF-8. char could also be 16 bit or more, you could then use UTF-16.

Returning UTF-8 is an obvious choice. If the application that uses your exceptions uses a different multibyte encoding, it might have a hard time displaying the string though. (It can't know it's UTF-8, can it?) On the other hand, for ISO-8859-* 8bit encodings (Western european, cyrillic, etc.) displaying a UTF-8 string will "just" display some gibberish and you (or your user) might be fine with that if you cannot disambiguate btw. a char* in the locale character set and UTF-8.

Personally I think only low level error messages should go into what() strings and personally I think these should be english anyway. (Maybe combined with some error number or whatnot.)

The worst problem I see with what() is that it is not uncommon to include some contextual details in the what() message, for example a filename. Filenames are non ASCII rather often, so you are left with no choice but to use UTF-8 as the what() encoding.

Note also that your exception class (that's derived from std::exception) can obviously provide any access methods you like and so it might make sense to add an explicit what_utf8() or what_utf16() or what_iso8859_5().

Edit: Regarding John's comment on how to return UTF-8:

If you have a const char* what() function this function essentially returns a bunch of bytes. On a western european windows platform, these bytes would usually be encoded as Win1252, but on a russian windows it might as well be Win1251.

What the bytes return signify depends on their encoding and their encoding depends on where they "came from" (and who is interpreting them). A string literal's encoding is defined at compile time, but at runtime it's still up to the application how to interpret these.

So, to have your exception return UTF-8 strings with what() (or what_utf8()) you have to make sure that:

The input message to your exception has a well defined encoding
You have a well defined encoding for the string member you use to hold the message.
You appropriately convert the encoding when what()is called

Example:

struct MyExc : virtual public std::exception {
  MyExc(const char* msg)
  : exception(msg)
  { }
  std::string what_utf8() {
    return convert_iso8859_1_to_utf8( what() );
  }
};

// In a ISO-8859-1 encoded source file
const char* my_err_msg = "ISO-8859-1 ... äöüß ...";
...
throw MyExc(my_err_msg);
...
catch(MyExc const& e) {
  std::string iso8859_1_msg = e.what();
  std::string utf_msg = e.what_utf8();
...

The conversion could also be placed in the (overridden) what() member function of MyExc() or you could define the exception to take an already UTF-8 encoded string or you could convert (from an expected input encoding, maybe wchar_t/UTF-16) in the ctor.

The first question is what do you intend to do with the what() string?

Do you plan to log the information somewhere?

If so you should not be using the content of the what() string you should be using that string as a reference to look up the correct local specific logging message. So to me the content of the what() is not for logging purposes (or any form of display) it is a method of looking up the actual logging string (which can be any Unicode string).

Now; It can be us-full for the what() string to contain a human readable message for the developers to help in quick debugging (but for this highly readable polished text is not required). As result there is no reason to support anything more than ASCII. Obey the KISS principle.

A const char* doesn't have to point to an ASCII string; it can be in a multi-byte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also involves more copying and memory allocation than you may be comfortable with in an exception handler.

I usually just define my own base exception class, which uses wstring instead of string in the constructor and returns a const wstring& from what(). It's not that big of a deal. The lack of a standard one is a pretty big oversight.

Another valid opinion is that exception strings should never be presented to the user, so localizing them isn't necessary and so you don't have to worry about any of the above.

Standard doesn't specify what encoding is the string returned by what(), neither there is any defacto standard. I just encode it as UTF-8 and return from what(), in my projects. Of course there may be incompatibility with other libraries.

See also: https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful for why UTF-8 is good choice.

It is better way to add unicode in error processing:

try
{
   // some code
}
catch (std::exception & ex)
{
    report_problem(ex.what())
}

And :

void report_problem(char const * const)
{
   // here we can convert char to wchar_t or do some more else
   // log it, save to file or message to user
}

what() is generally not meant to display a message to a user. Among other things the text it returns is not localizable (even if it was Unicode). I'd just use what() to display something of value to you as the developer (like the source file and line number of the place where the exception was raised) and for that sort of text, ASCII is usually more than enough.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Edit: Made CW, commenters may edit in why this link is relevant if they wish