开发者

Parsing in any encoding

开发者 https://www.devze.com 2023-03-27 18:27 出处:网络
I use file_get_contents function to parse the remote pages. The problem in encoding. When i trying to parse site with utf-8 all work good but when encode is cp1251 i have in result next:

I use file_get_contents function to parse the remote pages. The problem in encoding. When i trying to parse site with utf-8 all work good but when encode is cp1251 i have in result next:

�����.UA / ������� ������: ������, ����, ���������, ������, ������, �������, �������� � ��., ������, ������, �����, ����, ����� � ������ ������

This function working like a facebook link publish. User enter the link and get result. I need some function or method to parse sites in 开发者_如何学Pythonany encoding. Script encoding - UTF-8.


You can try mb_check_encoding() and try some encodings until you see one fit.

However, you should play around a bit with the stream context of file_get_contents() or even use cURL to fetch the site. This way, you can get the headers, and among them, the encoding used for the document. Once you know the encoding, it should be easy to convert it to UTF-8.

0

精彩评论

暂无评论...
验证码 换一张
取 消