mbstring comes to the rescue
I've been working with SimpleXML a fair amount lately, and have run into an issue a number of times with character encodings. Basically, if a string has a mixture of UTF-8 and non-UTF-8 characters, SimpleXML barfs, claiming the "String could not be parsed as XML."
I tried a number of solutions, hoping actually to automate it via mbstring INI settings; these schemes all failed. iconv didn't work properly. The only thing that did work was to convert the encoding to latin1 — but this wreaked havoc with actual UTF-8 characters.
Then, through a series of trial-and-error, all-or-nothing shots, I stumbled on a simple solution. Basically, I needed to take two steps:
- Detect the current encoding of the string
- Convert that encoding to UTF-8
which is accomplished with:
$enc = mb_detect_encoding($xml);
$xml = mb_convert_encoding($xml, 'UTF-8', $enc);
The conversion is performed even if the detected encoding is UTF-8; the conversion ensures that all characters in the string are properly encoded when done.
It's a non-intuitive solution, but it works! QED.