A common problem faced is that the user wants to copy & paste their article or part of it from Microsoft Word, directly into a textarea on a page. The problem is word uses non UTF-8 characters. Once the page is submitted, PHP gets it and the characters are encoded differently and they display weirdly when displayed back via an echo.
A Commonly available solution for this is to create a function which parses the input and cleans it up either by removing the non standard characters or converting them into standard UTF-8 characters
Here is an example of one such
function fWordCharacterConverter($str)
{
$invalid = array('Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'd'=>'dj', 'Ž'=>'Z', 'ž'=>'z',
'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A',
'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E',
'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y',
'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a',
'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i',
'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'R'=>'R', 'r'=>'r', "`" => "'", "´" => "'", "„" => ",", "`" => "'",
"´" => "'", "“" => "\"", "”" => "\"", "´" => "'", "’" => "'", "{" => "",
"~" => "", "–" => "-", "’" => "'");
$str = str_replace(array_keys($invalid), array_values($invalid), $str);
return $str;
}
{
$invalid = array('Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'd'=>'dj', 'Ž'=>'Z', 'ž'=>'z',
'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A',
'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E',
'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y',
'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a',
'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i',
'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'R'=>'R', 'r'=>'r', "`" => "'", "´" => "'", "„" => ",", "`" => "'",
"´" => "'", "“" => "\"", "”" => "\"", "´" => "'", "’" => "'", "{" => "",
"~" => "", "–" => "-", "’" => "'");
$str = str_replace(array_keys($invalid), array_values($invalid), $str);
return $str;
}
Another solution which is much less known but works well is to use the php function iconv
$str= iconv('UTF-8', 'ASCII//TRANSLIT', $str);
Details from the PHP Manual:
string iconv
( string
$in_charset
, string $out_charset
, string $str
)Parameters
-
in_charset
- The input charset.
-
out_charset
- The output charset.If you append the string //TRANSLIT to
out_charset
transliteration is activated. This means that when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise,str
is cut from the first illegal character and anE_NOTICE
is generated. -
str
- The string to be converted.
Return Values
Returns the converted string or
FALSE
on failure.
str
fromin_charset
toout_charset
.