开发者

How to write a std::string to a UTF-8 text file

开发者 https://www.devze.com 2023-01-02 18:28 出处:网络
I just want to write some fe开发者_如何学JAVAw simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?The only way UTF-8 affects std::s

I just want to write some fe开发者_如何学JAVAw simple lines to a text file in C++, but I want them to be encoded in UTF-8. What is the easiest and simple way to do so?


The only way UTF-8 affects std::string is that size(), length(), and all the indices are measured in bytes, not characters.

And, as sbi points out, incrementing the iterator provided by std::string will step forward by byte, not by character, so it can actually point into the middle of a multibyte UTF-8 codepoint. There's no UTF-8-aware iterator provided in the standard library, but there are a few available on the 'Net.

If you remember that, you can put UTF-8 into std::string, write it to a file, etc. all in the usual way (by which I mean the way you'd use a std::string without UTF-8 inside).

You may want to start your file with a byte order mark so that other programs will know it is UTF-8.


There is nice tiny library to work with utf8 from c++: utfcpp


libiconv is a great library for all our encoding and decoding needs.

If you are using Windows you can use WideCharToMultiByte and specify that you want UTF8.


What is the easiest and simple way to do so?

The most intuitive and thus easiest handling of utf8 in C++ is for sure using a drop-in replacement for std::string. As the internet still lacks of one, I went to implement the functionality on my own:

tinyutf8 (EDIT: now Github).

This library provides a very lightweight drop-in preplacement for std::string (or std::u32string if you will, because you iterate over codepoints rather that chars). Ity is implemented succesfully in the middle between fast access and small memory consumption, while being very robust. This robustness to 'invalid' UTF8-sequences makes it (nearly completely) compatible with ANSI (0-255).

Hope this helps!


If by "simple" you mean ASCII, there is no need to do any encoding, since characters with an ASCII value of 127 or less are the same in UTF-8.


std::wstring text = L"Привет";
QString qstr = QString::fromStdWString(text);
QByteArray byteArray(qstr.toUtf8());    
std::string str_std( byteArray.constData(), byteArray.length());


Use Glib::ustring from glibmm.

It is the only widespread UTF-8 string container (AFAIK). While glyph (not byte) based, it has the same method signatures as std::string so the port should be simple search and replace (just make sure that your data is valid UTF-8 before loading it into a ustring).


My preference is to convert to and from a std::u32string and work with codepoints internally, then convert to utf8 when writing out to a file using these converting iterators I put on github.

#include <utf/utf.h>

int main()
{
    using namespace utf;

    u32string u32_text = U"ɦΈ˪˪ʘ";
    // do stuff with string
    // convert to utf8 string
    utf32_to_utf8_iterator<u32string::iterator> pos(u32_text.begin());
    utf32_to_utf8_iterator<u32string::iterator> end(u32_text.end());

    u8string u8_text(pos, end);

    // write out utf8 to file.
    // ...
}


As to UTF-8 is multibite characters string and so you get some problems to work and it's a bad idea/ Instead use normal Unicode.

So by my opinion best is use ordinary ASCII char text with some codding set. Need to use Unicode if you use more than 2 sets of different symbols (languages) in single.

It's rather rare case. In most cases enough 2 sets of symbols. For this common case use ASCII chars, not Unicode.

Effect of using multibute chars like UTF-8 you get only China traditional, arabic or some hieroglyphic text. It's very very rare case!!!

I don't think there are many peoples needs that. So never use UTF-8!!! It's avoid strong headache of manipulate such strings.

0

精彩评论

暂无评论...
验证码 换一张
取 消