开发者

How can I convert German characters during XML read and PHP write into mysql?

开发者 https://www.devze.com 2023-01-03 11:07 出处:网络
Morning, I am inputting data from an XML file into my database, but have any isse with German words (that are in the XML by mistake)

Morning,

I am inputting data from an XML file into my database, but have any isse with German words (that are in the XML by mistake)

For example the word für appears in my XML as für and thus appears the same in my database.

I know I could do a simple search/replace for that exact phrase, but I was wondering if there was a smarter way to do it as I can't predict if any other German words may one day appear in the XML?

ADDING SOME MORE DETAIL

The XML source says:

<?xml version="1.0" encoding="UTF-8" ?> 

and in my PHP I have

$domString = utf8_encode($dom->saveXML($element));

If I look into the XML file before I start reading it, it has -

 <titl开发者_如何学JAVAe> - <![CDATA[ CoPilot Live v8 Europa für Android 8.0.0.644 ]]> </title> 

Thanks.

Greg


This normally happens when UTF-8 data is deconded as ISO-8859-1 for example. In UTF-8 the german umlaut ü is represented by two bytes, in ISO-8859-1, it's one byte. the two bytes get decoded one by one resulting in an à and a ¼. Your task would be this:

  • read the XML's bytes
  • decode them using UTF-8

Check http://www.utf8-zeichentabelle.de/ for byte values.

However, all in all, the idea of fixing this is pretty bad. You end up guessing encoding, not to talk about wrong encoded/decoded characters are encoded/decoded again... good luck!

EDIT:

I have used juniversalchardet, a library for guessing character encoding, in the meantime, and it seems to work fine. Maybe you give it a try.


use the same encoding everywhere and there will be no such problems. and if you have to choose an encoding: use UTF-8!

if you can't change it (why ever...) you have to use utf8_decode to get the correct values.


Don't forget that if you are using DOMDocument then no matter what encoding your script is in, it converts everything internally to UTF8.

Also if you are using htmlentities, unless you specifically tell it to, it will use ISO-8859-1 encoding by default. Took me a while to figure this out!

Useful comment here, also from a german language perspective.


For some things utf8_decode would work. You might want to have a look at his function as well: http://www.php.net/manual/en/normalizer.normalize.php#92592

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号