STL and UTF-8 file input/output. How to do it?_问答_开发者

STL and UTF-8 file input/output. How to do it?

开发者 https://www.devze.com 2023-01-22 03:19 出处：网络

I use wchar_t for internal strings and UTF-8 for storage in files. I need to use STL to input/output text to screen and also do it by using full Lithuanian charset.

相关专题：stl utf-8

I use wchar_t for internal strings and UTF-8 for storage in files. I need to use STL to input/output text to screen and also do it by using full Lithuanian charset.

It's all fine because I'm not forced to do the same for files, so the following example does the job just fine:

#include <io.h>
#include <fcntl.h>
#include <iostream>
    _setmode (_fileno(stdout), _O_U16TEXT);
    wcout << L"AaĄąﬂ" << endl;

But I became curious and attempted to do the same with files with no success. Of course I could use formatted input/output, but that is... discouraged.

    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    _fwprintf_p (fp, L"AaĄą\nﬂ");
    fclose (fp);
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmo开发者_如何转开发de (_fileno (fp), _O_U8TEXT);
    wchar_t text[256];
    fseek (fp, NULL, SEEK_SET);
    fwscanf (fp, L"%s", text);
    wcout << text << endl;
    fwscanf (fp, L"%s", text);
    wcout << text << endl;
    fclose (fp);

This snippet works perfectly (although I am not sure how it handles malformed chars). So, is there any way to:

get FILE* or integer file handle form a std::basic_*fstream?
simulate _setmode () on it?
extend std::basic_*fstream so it handles UTF-8 I/O?

Yes, I am studying at an university and this is somewhat related to my assignments, but I am trying to figure this out for myself. It won't influence my grade or anything like that.

Use std::codecvt_facet template to perform the conversion.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.pubimbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system.

Well, after some testing I figured out that FILE is accepted for _iobuf (in the w*fstream constructor). So, the following code does what I need.

#include <iostream>
#include <fstream>
#include <io.h>
#include <fcntl.h>
//For writing
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_out_test.txt", L"w");
    _setmode (_fileno (fp), _O_U8TEXT);
    wofstream fs (fp);
    fs << L"ąﬂ";
    fclose (fp);
//And reading
    FILE* fp;
    _wfopen_s (&fp, L"utf-8_in_test.txt", L"r");
    _setmode (_fileno (fp), _O_U8TEXT);
    wifstream fs (fp);
    wchar_t array[6];
    fs.getline (array, 5);
    wcout << array << endl;//For debug
    fclose (fp);

This sample reads and writes legit UTF-8 files (without BOM) in Windows compiled with Visual Studio 2k8.

Can someone give any comments about portability? Improvements?

The easiest way would be to do the conversion to UTF-8 yourself before trying to output. You might get some inspiration from this question: UTF8 to/from wide char conversion in STL

get FILE* or integer file handle form a std::basic_*fstream?

Answered elsewhere.

You can't make STL to directly work with UTF-8. The basic reason is that STL indirectly forbids multi-char characters. Each character has to be one char/wchar_t.

Microsoft actually breaks the standard with their UTF-16 encoding, so maybe you can get some inspiration there.