Re the previous post about converting GB2312 code to Unicode code which displayed the following function:
<?
// Program by sadly (www.phpx.com)
function gb2unicode($gb)
{
if(!trim($gb))
return $gb;
$filename="gb2312.txt";
$tmp=file($filename);
$codetable=array();
while(list($key,$value)=each($tmp))
$codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
$utf="";
while($gb)
{
if (ord(substr($gb,0,1))>127)
{
$this=substr($gb,0,2);
$gb=substr($gb,2,strlen($gb));
$utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
}
else
{
$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);
}
}
return $utf;
}
?>
I found that a small change was needed in the code to properly handle latin characters embedded in the middle of gb2312 text, as when the text includes a URL or email address. Just reverse the two lines in the part of the statement above that handles ord vals !>127.
Change:
$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);
to:
$utf.=substr($gb,0,1);
$gb=substr($gb,1,strlen($gb));
In the original function, the first latin chacter was dropped and it was not converting the first non-latin character after the latin text (everything was shifted one character too far to the right). Reversing those two lines makes it work correctly in every example I have tried.
Also, the source of the gb2312.txt file needed for this to work has changed. You can find it a couple places:
http://tcl.apache.org/sources/tcl/tools/encoding/gb2312.txt
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT