PHP - Manual: utf8_encode

2025-12-06

vfprintf »

« utf8_decode

utf8_encode

(PHP 4, PHP 5, PHP 7, PHP 8)

utf8_encode — 将字符串从 ISO-8859-1 转换为 UTF-8 编码

警告

此函数自 PHP 8.2.0 起弃用。强烈建议不要应用此函数。

说明

#[\Deprecated]
utf8_encode(string $string): string

该函数将 string 字符串从 ISO-8859-1 编码转换为 UTF-8。

注意:
This function does not attempt to guess the current encoding of the provided string, it assumes it is encoded as ISO-8859-1 (also known as "Latin 1") and converts to UTF-8. Since every sequence of bytes is a valid ISO-8859-1 string, this never results in an error, but will not result in a useful string if a different encoding was intended.

Many web pages marked as using the ISO-8859-1 character encoding actually use the similar Windows-1252 encoding, and web browsers will interpret ISO-8859-1 web pages as Windows-1252. Windows-1252 features additional printable characters, such as the Euro sign (€) and curly quotes (“ ”), instead of certain ISO-8859-1 control characters. This function will not convert such Windows-1252 characters correctly. Use a different function if Windows-1252 conversion is required.

参数

string: ISO-8859-1 字符串。

返回值

返回 string 的 UTF-8 翻译。

更新日志

版本	说明
8.2.0	弃用此函数。
7.2.0	This function has been moved from the XML extension to the core of PHP. In previous versions, it was only available if the XML extension was installed.

示例

示例 #1 基础示例

<?php
// Convert the string 'Zoë' from ISO 8859-1 to UTF-8
$iso8859_1_string = "\x5A\x6F\xEB";
$utf8_string = utf8_encode($iso8859_1_string);
echo bin2hex($utf8_string), "\n";
?>

以上示例会输出：

5a6fc3ab

注释

注意: 弃用和替代方案
从 PHP 8.2.0 开始，弃用此函数，并将在未来的版本中删除。应检查现有用途并用适当的替代方案。

类似的功能可以通过 mb_convert_encoding() 实现，支持 ISO-8859-1 和许多其他字符编码。
<?php $iso8859_1_string = "\xEB"; // 'ë' (e with diaeresis) in ISO-8859-1 $utf8_string = mb_convert_encoding($iso8859_1_string, 'UTF-8', 'ISO-8859-1'); echo bin2hex($utf8_string), "\n"; $iso8859_7_string = "\xEB"; // the same string in ISO-8859-7 represents 'λ' (Greek lower-case lambda) $utf8_string = mb_convert_encoding($iso8859_7_string, 'UTF-8', 'ISO-8859-7'); echo bin2hex($utf8_string), "\n"; $windows_1252_string = "\x80"; // '€' (Euro sign) in Windows-1252, but not in ISO-8859-1 $utf8_string = mb_convert_encoding($windows_1252_string, 'UTF-8', 'Windows-1252'); echo bin2hex($utf8_string), "\n"; ?>

以上示例会输出：
c3ab
cebb
e282ac
根据安装的扩展，其他有效选项是 UConverter::transcode() 和 iconv()。

以下都给出相同的结果：
<?php $iso8859_1_string = "\x5A\x6F\xEB"; // 'Zoë' in ISO-8859-1 $utf8_string = utf8_encode($iso8859_1_string); echo bin2hex($utf8_string), "\n"; $utf8_string = mb_convert_encoding($iso8859_1_string, 'UTF-8', 'ISO-8859-1'); echo bin2hex($utf8_string), "\n"; $utf8_string = UConverter::transcode($iso8859_1_string, 'UTF8', 'ISO-8859-1'); echo bin2hex($utf8_string), "\n"; $utf8_string = iconv('ISO-8859-1', 'UTF-8', $iso8859_1_string); echo bin2hex($utf8_string), "\n"; ?>

以上示例会输出：
5a6fc3ab
5a6fc3ab
5a6fc3ab
5a6fc3ab

参见

utf8_decode() - 将字符串从 UTF-8 转换为 ISO-8859-1，替换无效或者无法表示的字符。
mb_convert_encoding() - 转换字符串，从一个字符编码到另一个字符编码
UConverter::transcode() - Convert a string from one character encoding to another
iconv() - 将字符串从一个字符编码转换到另一个字符编码

发现了问题？

了解如何改进此页面 • 提交拉取请求 • 报告一个错误

＋添加备注

用户贡献的备注 24 notes

down

140

deceze at gmail dot com ¶

13 years ago

Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be "iso88591_to_utf8". If your text is not encoded in  ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.

If you need to convert text from any encoding to any other encoding, look at iconv() instead.

down

Aidan Kehoe <php-manual at parhasard dot net> ¶

20 years ago

Here's some code that addresses the issue that Steven describes in the previous comment; 

<?php

/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
   as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
   the UTF-8 encoding of the non-control characters that Windows-1252 places
   at the equivalent code points. */

$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92",     /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86",     /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0",     /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92",     /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd",     /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

"\xc2\x98" => "\xcb\x9c",     /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1",     /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93",     /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe",     /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8"      /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);

function cp1252_to_utf8($str) {
        global $cp1252_map; 
        return  strtr(utf8_encode($str), $cp1252_map);
}

?>

down

Pini ¶

9 years ago

My version of utf8_encode_deep, 
In case you need one that returns a value without changing the original.

        /**
        * Convert Anything To UTF-8
        * @param mixed $var The variable you want to convert.
        * @param boolean $deep Deep convertion? (*Default: TRUE).
        * @return mixed
        */
        function anything_to_utf8($var,$deep=TRUE){
            if(is_array($var)){
                foreach($var as $key => $value){
                    if($deep){
                        $var[$key] = anything_to_utf8($value,$deep);
                    }elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,'utf-8',true)){
                         $var[$key] = utf8_encode($var);
                    }
                }
                return $var;
            }elseif(is_object($var)){
                foreach($var as $key => $value){
                    if($deep){
                        $var->$key = anything_to_utf8($value,$deep);
                    }elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,'utf-8',true)){
                         $var->$key = utf8_encode($var);
                    }
                }
                return $var;
            }else{
                return (!mb_detect_encoding($var,'utf-8',true))?utf8_encode($var):$var;
            }
        }

down

a dot rueedlinger at gmail dot com ¶

11 years ago

If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:

<?php
function utf8_string_array_encode(&$array){
$func = function(&$value,&$key){
        if(is_string($value)){
$value = utf8_encode($value);
        } 
        if(is_string($key)){
$key = utf8_encode($key);
        }
        if(is_array($value)){
utf8_string_array_encode($value);
        }
    };
array_walk($array,$func);
    return $array;
}
?>

down

bisqwit at iki dot fi ¶

19 years ago

For reference, it may be insightful to point out that:
  utf8_encode($s)
is actually identical to:
  recode_string('latin1..utf8', $s)
and:
  iconv('iso-8859-1', 'utf-8', $s)
That is, utf8_encode is a specialized case of character set conversions.

If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.

down

Oscar Broman ¶

12 years ago

Walk through nested arrays/objects and utf8 encode all strings.

<?php
// Usage
class Foo {
    public $somevar = 'whoop whoop';
}

$structure = array(
'object' => (object) array(
'entry' => 'hello wörld',
'another_array' => array(
'string',
1234,
'another string'
)
    ),
'string' => 'foo',
'foo_object' => new Foo
);

utf8_encode_deep($structure);

// $structure is now utf8 encoded
print_r($structure);

// The function
function utf8_encode_deep(&$input) {
    if (is_string($input)) {
$input = utf8_encode($input);
    } else if (is_array($input)) {
        foreach ($input as &$value) {
utf8_encode_deep($value);
        }

        unset($value);
    } else if (is_object($input)) {
$vars = array_keys(get_object_vars($input));

        foreach ($vars as $var) {
utf8_encode_deep($input->$var);
        }
    }
}
?>

down

rocketman ¶

18 years ago

If you are looking for a function to replace special characters with the hex-utf-8 value (e.g. für Webservice-Security/WSS4J compliancy) you might use this:

$textstart = "Größe";
$utf8 ='';
$max = strlen($txt);

for ($i = 0; $i < $max; $i++) {

if ($txt{i} == "&"){
$neu = "&x26;";
}
elseif ((ord($txt{$i}) < 32) or (ord($txt{$i}) > 127)){
$neu = urlencode(utf8_encode($txt{$i}));
$neu = preg_replace('#\%(..)\%(..)\%(..)#','&#x\1;&#x\2;&#x\3;',$neu);
$neu = preg_replace('#\%(..)\%(..)#','&#x\1;&#x\2;',$neu);
$neu = preg_replace('#\%(..)#','&#x\1;',$neu);
}
else {
$neu = $txt{$i};
}

$utf8 .= $neu;
} // for $i

$textnew = $utf8;

In this example $textnew will be "Gr&#xC3;&#xB6;&#xC3;&#x9F;e"

down

Janci ¶

19 years ago

I was searching for a function similar to Javascript's unescape(). In most cases it is OK to use url_decode() function but not if you've got UTF characters in the strings. They are converted into %uXXXX entities that url_decode() cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities (&#xxx;) that your browser can show correctly. If you're OK with that, the function can be found here: http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps

But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf() function mentioned in some other note here, I have managed to achieve my goal:

<?php
/**
 * Function converts an Javascript escaped string back into a string with specified charset (default is UTF-8). 
 * Modified function from http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
 *
 * @param string $source escaped with Javascript's escape() function
 * @param string $iconv_to destination character set will be used as second paramether in the iconv function. Default is UTF-8.
 * @return string
 */
function unescape($source, $iconv_to = 'UTF-8') {
$decodedStr = '';
$pos = 0;
$len = strlen ($source);
    while ($pos < $len) {
$charAt = substr ($source, $pos, 1);
        if ($charAt == '%') {
$pos++;
$charAt = substr ($source, $pos, 1);
            if ($charAt == 'u') {
// we got a unicode character
$pos++;
$unicodeHexVal = substr ($source, $pos, 4);
$unicode = hexdec ($unicodeHexVal);
$decodedStr .= code2utf($unicode);
$pos += 4;
            }
            else {
// we have an escaped ascii character
$hexVal = substr ($source, $pos, 2);
$decodedStr .= chr (hexdec ($hexVal));
$pos += 2;
            }
        }
        else {
$decodedStr .= $charAt;
$pos++;
        }
    }

    if ($iconv_to != "UTF-8") {
$decodedStr = iconv("UTF-8", $iconv_to, $decodedStr);
    }

    return $decodedStr;
}

/**
 * Function coverts number of utf char into that character.
 * Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
 *
 * @param int $num
 * @return utf8char
 */
function code2utf($num){
    if($num<128)return chr($num);
    if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
    if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
    if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
    return '';
}
?>

down

rogeriogirodo at gmail dot com ¶

15 years ago

This function may be useful do encode array keys and values [and checks first to see if it's already in UTF format]:


<?php

public static function to_utf8($in)

{

        if (is_array($in)) {

            foreach ($in as $key => $value) {

$out[to_utf8($key)] = to_utf8($value);

            }

        } elseif(is_string($in)) {

            if(mb_detect_encoding($in) != "UTF-8")

                return utf8_encode($in);

            else

                return $in;

        } else {

            return $in;

        }

        return $out;

}

?>



Hope this may help.


[NOTE BY danbrown AT php DOT net: Original function written by (cmyk777 AT gmail DOT com) on 28-JAN-09.]

down

powtac 4t gmx d0t de ¶

14 years ago

I tried a lot of things, but this seems to be the final fail save method to convert any string to proper UTF-8. 


<?php

function _convert($content) {

    if(!mb_check_encoding($content, 'UTF-8')

        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {


$content = mb_convert_encoding($content, 'UTF-8');


        if (mb_check_encoding($content, 'UTF-8')) {

// log('Converted to UTF-8');

} else {

// log('Could not converted to UTF-8');

}

    }

    return $content;

}

?>

down

Anonymous ¶

19 years ago

// Reads a file story.txt ascii (as typed on keyboard) 
// converts it to Georgian character using utf8 encoding
// if I am correct(?) just as it should be when typed on Georgian computer
// it outputs it as an html file
// 
// http://www.comweb.nl/keys_to_georgian.html
// http://www.comweb.nl/keys_to_georgian.php
// http://www.comweb.nl/story.txt

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<HTML>
<HEAD>
<TITLE>keys to unicode code</TITLE>

// this meta tag is needed
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

// note the sylfean font seems to be standard installed on Windows XP
// It supports Georgian

<style TYPE="text/css">
<!--
body {font-family:sylfaen; }
-->
</style>
</HEAD>

<BODY>

<?
$eng=array(97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,87,82,84,83,
67,74,90);
$geo=array(4304,4305,4330,4307,4308,4324,4306,4336,4312,4335,4313,
4314,4315,4316,4317,4318,4325,4320,4321,4322,4323,4309,
4332,4334,4327,4310,4333,4326,4311,4328,4329,4319,4331,
91,93,59,39,44,46,96);

$fc=file("story.txt");
foreach($fc as $line)
{
   $spacestart=1;
   for ($i=0; $i<strlen($line); $i+=1)
   {
      $character=ord(substr($line,$i,1));
      $found=0;
      for ($k=0; $k<count($eng); $k+=1)
      {
         if ($eng[$k]==$character)
         {
             print code2utf( $geo[$k] );
             $found=1;
         }
      }
      if ($found==0) 
      {
         if ($character==126 || $character==32 || $character==10 || $character==9)
         {
            if ($character==9)  { print '&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'; }
            if ($character==10) { print "<BR>\n"; }
            if ($character==32) 
            { 
               if ($spacestart==1) {print '&nbsp;'; } else { print " "; }
            }
            if ($character==126){ print "~";      }
         } else
         { 
            print substr($line,$i,1);
         } 
      }
      if ($character!=32) { $spacestart=0; }
   }
}

/**
 * Function coverts number of utf char into that character.
 * Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
 *
 * @param int $num
 * @return utf8char
*/
function code2utf($num)
{
   if($num<128)return chr($num);
   if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
   if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
   if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
   return '';
}
?>

</BODY>
</HTML>

down

hrpeters (at) gmx (dot) net ¶

20 years ago

// Validate Unicode UTF-8 Version 4

// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

// It also flags overlong bytes as error


function is_validUTF8($str)

{

    // values of -1 represent disalloweded values for the first bytes in current UTF-8

    static $trailing_bytes = array (

        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

        0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

        -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,

        -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,

        -1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

        2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1

    );


    $ups = unpack('C*', $str);

    if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8 

    for ($i = 1; $i <= $aCnt;)

    {

        if (!($tbytes = $trailing_bytes[($b1 = $ups[$i++])])) continue;

        if ($tbytes == -1) return false;


        $first = true;

        while ($tbytes > 0 && $i <= $aCnt)

        {

            $cbyte = $ups[$i++];

            if (($cbyte & 0xC0) != 0x80) return false;


            if ($first)

            {

                switch ($b1)

                {

                    case 0xE0:

                        if ($cbyte < 0xA0) return false;

                        break;

                    case 0xED:

                        if ($cbyte > 0x9F) return false;

                        break;

                    case 0xF0:

                        if ($cbyte < 0x90) return false;

                        break;

                    case 0xF4:

                        if ($cbyte > 0x8F) return false;

                        break;

                    default:

                        break;

                }

                $first = false;

            }

            $tbytes--;

        }

        if ($tbytes) return false; // incomplete sequence at EOS

    }        

    return true;

}

down

Mark AT modernbill DOT com ¶

20 years ago

If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.

down

-1

www.tricinty.com ¶

16 years ago

<?php
/** 
    * Encodes an ISO-8859-1 mixed variable to UTF-8 (PHP 4, PHP 5 compat)
    * @param    mixed    $input An array, associative or simple
    * @param    boolean  $encode_keys optional
    * @return    mixed     ( utf-8 encoded $input)
    */

function utf8_encode_mix($input, $encode_keys=false)
    {
        if(is_array($input))
        {
$result = array();
            foreach($input as $k => $v)
            {                
$key = ($encode_keys)? utf8_encode($k) : $k;
$result[$key] = utf8_encode_mix( $v, $encode_keys);
            }
        }
        else
        {
$result = utf8_encode($input);
        }

        return $result;
    }
?>

down

-2

mailing at jcn50 dot com ¶

19 years ago

I recommend using this alternative for every language:

$new=mb_convert_encoding($s,"UTF-8","auto");

Don't forget to set all your pages to "utf-8" encoding, otherwise just use HTML entities.

jcn50.

down

-3

Yumok ¶

14 years ago

Avoiding use of preg_match to detect if utf8_encode is needed:


<?php

                $string = $string_input; // avoid being destructive


$string = preg_replace("#[\x09\x0A\x0D\x20-\x7E]#"        ,"",$string);         // ASCII

$string = preg_replace("#[\xC2-\xDF][\x80-\xBF]#"            ,"",$string);             // non-overlong 2-byte

$string = preg_replace("#\xE0[\xA0-\xBF][\x80-\xBF]#"    ,"",$string);     // excluding overlongs

$string = preg_replace("#[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}#","",$string);     // straight 3-byte

$string = preg_replace("#\xED[\x80-\x9F][\x80-\xBF]#"    ,"",$string);     // excluding surrogates

$string = preg_replace("#\xF0[\x90-\xBF][\x80-\xBF]{2}#","",$string);     // planes 1-3

$string = preg_replace("#[\xF1-\xF3][\x80-\xBF]{3}#"    ,"",$string);     //  planes 4-15

$string = preg_replace("#\xF4[\x80-\x8F][\x80-\xBF]{2}#","",$string);     // plane 16


$rc = ($string == ""?true:false);

?>

down

-4

suttichai at ceforce dot com ¶

19 years ago

This function I use convert Thai font (iso-8859-11) to UTF-8. For my case, It work properly. Please try to use this function if you have a problem to convert charset iso-8859-11 to UTF-8.

function iso8859_11toUTF8($string) {

     if ( ! ereg("[\241-\377]", $string) )
         return $string;

     $iso8859_11 = array(
"\xa1" => "\xe0\xb8\x81",
"\xa2" => "\xe0\xb8\x82",
"\xa3" => "\xe0\xb8\x83",
"\xa4" => "\xe0\xb8\x84",
"\xa5" => "\xe0\xb8\x85",
"\xa6" => "\xe0\xb8\x86",
"\xa7" => "\xe0\xb8\x87",
"\xa8" => "\xe0\xb8\x88",
"\xa9" => "\xe0\xb8\x89",
"\xaa" => "\xe0\xb8\x8a",
"\xab" => "\xe0\xb8\x8b",
"\xac" => "\xe0\xb8\x8c",
"\xad" => "\xe0\xb8\x8d",
"\xae" => "\xe0\xb8\x8e",
"\xaf" => "\xe0\xb8\x8f",
"\xb0" => "\xe0\xb8\x90",
"\xb1" => "\xe0\xb8\x91",
"\xb2" => "\xe0\xb8\x92",
"\xb3" => "\xe0\xb8\x93",
"\xb4" => "\xe0\xb8\x94",
"\xb5" => "\xe0\xb8\x95",
"\xb6" => "\xe0\xb8\x96",
"\xb7" => "\xe0\xb8\x97",
"\xb8" => "\xe0\xb8\x98",
"\xb9" => "\xe0\xb8\x99",
"\xba" => "\xe0\xb8\x9a",
"\xbb" => "\xe0\xb8\x9b",
"\xbc" => "\xe0\xb8\x9c",
"\xbd" => "\xe0\xb8\x9d",
"\xbe" => "\xe0\xb8\x9e",
"\xbf" => "\xe0\xb8\x9f",
"\xc0" => "\xe0\xb8\xa0",
"\xc1" => "\xe0\xb8\xa1",
"\xc2" => "\xe0\xb8\xa2",
"\xc3" => "\xe0\xb8\xa3",
"\xc4" => "\xe0\xb8\xa4",
"\xc5" => "\xe0\xb8\xa5",
"\xc6" => "\xe0\xb8\xa6",
"\xc7" => "\xe0\xb8\xa7",
"\xc8" => "\xe0\xb8\xa8",
"\xc9" => "\xe0\xb8\xa9",
"\xca" => "\xe0\xb8\xaa",
"\xcb" => "\xe0\xb8\xab",
"\xcc" => "\xe0\xb8\xac",
"\xcd" => "\xe0\xb8\xad",
"\xce" => "\xe0\xb8\xae",
"\xcf" => "\xe0\xb8\xaf",
"\xd0" => "\xe0\xb8\xb0",
"\xd1" => "\xe0\xb8\xb1",
"\xd2" => "\xe0\xb8\xb2",
"\xd3" => "\xe0\xb8\xb3",
"\xd4" => "\xe0\xb8\xb4",
"\xd5" => "\xe0\xb8\xb5",
"\xd6" => "\xe0\xb8\xb6",
"\xd7" => "\xe0\xb8\xb7",
"\xd8" => "\xe0\xb8\xb8",
"\xd9" => "\xe0\xb8\xb9",
"\xda" => "\xe0\xb8\xba",
"\xdf" => "\xe0\xb8\xbf",
"\xe0" => "\xe0\xb9\x80",
"\xe1" => "\xe0\xb9\x81",
"\xe2" => "\xe0\xb9\x82",
"\xe3" => "\xe0\xb9\x83",
"\xe4" => "\xe0\xb9\x84",
"\xe5" => "\xe0\xb9\x85",
"\xe6" => "\xe0\xb9\x86",
"\xe7" => "\xe0\xb9\x87",
"\xe8" => "\xe0\xb9\x88",
"\xe9" => "\xe0\xb9\x89",
"\xea" => "\xe0\xb9\x8a",
"\xeb" => "\xe0\xb9\x8b",
"\xec" => "\xe0\xb9\x8c",
"\xed" => "\xe0\xb9\x8d",
"\xee" => "\xe0\xb9\x8e",
"\xef" => "\xe0\xb9\x8f",
"\xf0" => "\xe0\xb9\x90",
"\xf1" => "\xe0\xb9\x91",
"\xf2" => "\xe0\xb9\x92",
"\xf3" => "\xe0\xb9\x93",
"\xf4" => "\xe0\xb9\x94",
"\xf5" => "\xe0\xb9\x95",
"\xf6" => "\xe0\xb9\x96",
"\xf7" => "\xe0\xb9\x97",
"\xf8" => "\xe0\xb9\x98",
"\xf9" => "\xe0\xb9\x99",
"\xfa" => "\xe0\xb9\x9a",
"\xfb" => "\xe0\xb9\x9b"
 );

     $string=strtr($string,$iso8859_11);
     return $string;
 }

Suttichai Mesaard-www.ceforce.com

down

-5

emze at donazga dot net ¶

18 years ago

/*
Every function seen so far is incomplete or resource consumpting. Here are two -- integer 2 utf sequence (i3u) and utf sequence to integer (u3i). Below is a code snippet that checks well behavior at the range boundaries.

Someday they might be hardcoded into PHP...
*/

function i3u($i) { // returns UCS-16 or UCS-32 to UTF-8 from an integer
  $i=(int)$i; // integer?
  if ($i<0) return false; // positive?
  if ($i<=0x7f) return chr($i); // range 0
  if (($i & 0x7fffffff) <> $i) return '?'; // 31 bit?
  if ($i<=0x7ff) return chr(0xc0 | ($i >> 6)) . chr(0x80 | ($i & 0x3f));
  if ($i<=0xffff) return chr(0xe0 | ($i >> 12)) . chr(0x80 | ($i >> 6) & 0x3f)
      . chr(0x80  | $i & 0x3f);
  if ($i<=0x1fffff) return chr(0xf0 | ($i >> 18)) . chr(0x80 | ($i >> 12) & 0x3f)
      . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80  | $i & 0x3f);
  if ($i<=0x3ffffff) return chr(0xf8 | ($i >> 24)) . chr(0x80 | ($i >> 18) & 0x3f)
      . chr(0x80 | ($i >> 12) & 0x3f) . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80  | $i & 0x3f);
  return chr(0xfc | ($i >> 30)) . chr(0x80 | ($i >> 24) & 0x3f) . chr(0x80 | ($i >> 18) & 0x3f)
      . chr(0x80 | ($i >> 12) & 0x3f) . chr(0x80 | ($i >> 6) & 0x3f) . chr(0x80  | $i & 0x3f);
}

function u3i($s,$strict=1) { // returns integer on valid UTF-8 seq, NULL on empty, else FALSE
  // NOT strict: takes only DATA bits, present or not; strict: length and bits checking
  if ($s=='') return NULL;
  $l=strlen($s); $o=ord($s{0});
  if ($o <= 0x7f && $l==1) return $o;
  if ($l>6 && $strict) return false;
  if ($strict) for ($i=1;$i<$l;$i++) if (ord($s{$i}) > 0xbf || ord($s{$i})< 0x80) return false;
  if ($o < 0xc2) return false; // no-go even if strict=0
  if ($o <= 0xdf && ($l=2 && $strict)) return (($o & 0x1f) << 6 | (ord($s{1}) & 0x3f));
  if ($o <= 0xef && ($l=3 && $strict)) return (($o & 0x0f) << 12 | (ord($s{1}) & 0x3f) << 6
     |  (ord($s{2}) & 0x3f));
  if ($o <= 0xf7 && ($l=4 && $strict)) return (($o & 0x07) << 18 | (ord($s{1}) & 0x3f) << 12
     | (ord($s{2}) & 0x3f) << 6 |  (ord($s{3}) & 0x3f));
  if ($o <= 0xfb && ($l=5 && $strict)) return (($o & 0x03) << 24 | (ord($s{1}) & 0x3f) << 18
     | (ord($s{2}) & 0x3f) << 12 | (ord($s{3}) & 0x3f) << 6 |  (ord($s{4}) & 0x3f));
  if ($o <= 0xfd && ($l=6 && $strict)) return (($o & 0x01) << 30 | (ord($s{1}) & 0x3f) << 24
     | (ord($s{2}) & 0x3f) << 18 | (ord($s{3}) & 0x3f) << 12
     | (ord($s{4}) & 0x3f) << 6 |  (ord($s{5}) & 0x3f));
  return false;
}

// boundary behavior checking
$do=array(0x7f,0x7ff,0xffff,0x1fffff,0x3ffffff,0x7fffffff);
foreach ($do as $ii) for ($i=$ii;$i<=$ii+1; $i++) {
  $o=i3u($i);
  for ($j=0;$j<strlen($o);$j++) print "O[$j]=" . sprintf('%08b',ord($o{$j})) . ", ";
  print "c=$i, o=[$o].\n";
  print "Back: [$o] => [" . u3i($o) . "]\n";
}

down

-2

rattones at gmail dot com ¶

4 years ago

/**
 * Convert all values of an array to utf8_encode
 * @author Marcelo Ratton
 * @version 1.0
 * 
 * @param  array  $arr   array to encode values
 * @param  bool   $keys  true to convert keys to UTF8 
 * @return array  same   array but with all values encoded to UTF8
 */
function arrayEncodeToUTF8(array $arr, bool $keys= false) : array {
  $ret= [];
  foreach ($arr as $k=>$v) {
    if (is_array($v)) {
      $ret[$k]= arrayEncodeToUTF8($v);
    } else {
      if ($keys) {
        $k= utf8_encode((string)$k);
      }
      $ret[$k]= utf8_encode((string)$v);
    }
  }

  return $ret;
}

down

-2

ronen at greyzone dot com ¶

22 years ago

The following function will utf-8 encode unicode entities &#nnn(nn); with n={0..9}

/**
* takes a string of unicode entities and converts it to a utf-8 encoded string
* each unicode entitiy has the form &#nnn(nn); n={0..9} and can be displayed by utf-8 supporting
* browsers.  Ascii will not be modified.
* @param $source string of unicode entities [STRING]
* @return a utf-8 encoded string [STRING]
* @access public
*/
function utf8Encode ($source) {
    $utf8Str = '';
    $entityArray = explode ("&#", $source);
    $size = count ($entityArray);
    for ($i = 0; $i < $size; $i++) {
        $subStr = $entityArray[$i];
        $nonEntity = strstr ($subStr, ';');
        if ($nonEntity !== false) {
            $unicode = intval (substr ($subStr, 0, (strpos ($subStr, ';') + 1)));
            // determine how many chars are needed to reprsent this unicode char
            if ($unicode < 128) {
                $utf8Substring = chr ($unicode);
            }
            else if ($unicode >= 128 && $unicode < 2048) {
                $binVal = str_pad (decbin ($unicode), 11, "0", STR_PAD_LEFT);
                $binPart1 = substr ($binVal, 0, 5);
                $binPart2 = substr ($binVal, 5);

                $char1 = chr (192 + bindec ($binPart1));
                $char2 = chr (128 + bindec ($binPart2));
                $utf8Substring = $char1 . $char2;
            }
            else if ($unicode >= 2048 && $unicode < 65536) {
                $binVal = str_pad (decbin ($unicode), 16, "0", STR_PAD_LEFT);
                $binPart1 = substr ($binVal, 0, 4);
                $binPart2 = substr ($binVal, 4, 6);
                $binPart3 = substr ($binVal, 10);

                $char1 = chr (224 + bindec ($binPart1));
                $char2 = chr (128 + bindec ($binPart2));
                $char3 = chr (128 + bindec ($binPart3));
                $utf8Substring = $char1 . $char2 . $char3;
            }
            else {
                $binVal = str_pad (decbin ($unicode), 21, "0", STR_PAD_LEFT);
                $binPart1 = substr ($binVal, 0, 3);
                $binPart2 = substr ($binVal, 3, 6);
                $binPart3 = substr ($binVal, 9, 6);
                $binPart4 = substr ($binVal, 15);

                $char1 = chr (240 + bindec ($binPart1));
                $char2 = chr (128 + bindec ($binPart2));
                $char3 = chr (128 + bindec ($binPart3));
                $char4 = chr (128 + bindec ($binPart4));
                $utf8Substring = $char1 . $char2 . $char3 . $char4;
            }

            if (strlen ($nonEntity) > 1)
                $nonEntity = substr ($nonEntity, 1); // chop the first char (';')
            else 
                $nonEntity = '';

            $utf8Str .= $utf8Substring . $nonEntity;
        }
        else {
            $utf8Str .= $subStr;
        }
    }

    return $utf8Str;
}

Ronen.

down

-5

Karen ¶

21 years ago

Re the previous post about converting GB2312 code to Unicode code which displayed the following function:

<?
// Program by sadly (www.phpx.com)

function gb2unicode($gb)
{
   if(!trim($gb))
    return $gb;
   $filename="gb2312.txt";
   $tmp=file($filename);
   $codetable=array();
   while(list($key,$value)=each($tmp))
    $codetable[hexdec(substr($value,0,6))]=substr($value,9,4);
   $utf="";
   while($gb)
    {
      if (ord(substr($gb,0,1))>127)
     {
        $this=substr($gb,0,2);
        $gb=substr($gb,2,strlen($gb));
        $utf.="&#x".$codetable[hexdec(bin2hex($this))-0x8080].";";
      }
     else
     {
      $gb=substr($gb,1,strlen($gb));
      $utf.=substr($gb,0,1);
     }
     }
  return $utf;
}
?>

I found that a small change was needed in the code to properly handle latin characters embedded in the middle of gb2312 text, as when the text includes a URL or email address. Just reverse the two lines in the part of the statement above that handles ord vals !>127. 

Change:

$gb=substr($gb,1,strlen($gb));
$utf.=substr($gb,0,1);

to:

$utf.=substr($gb,0,1);
$gb=substr($gb,1,strlen($gb));

In the original function, the first latin chacter was dropped and it was not converting the first non-latin character after the latin text (everything was shifted one character too far to the right). Reversing those two lines makes it work correctly in every example I have tried.

Also, the source of the gb2312.txt file needed for this to work has changed. You can find it a couple places:

http://tcl.apache.org/sources/tcl/tools/encoding/gb2312.txt
ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT

down

-3

Net Raven ¶

20 years ago

I often need to convert multi language text sent to me for use in websites and other apps into UTF8 encoded so I can insert it into source code and databases.

I knocked up a small web page with its charset set to UTF8 then set it up so I can paste from the original doc (eg word or excel) and have the page return the UTF8 encoded version.

Of course the browser will convert the unicode to UTF8 for you as part of the submit (I use IE5 or better for this) then all you have to do in the PHP is encode the UTF8 so the browser will show it in its raw form.

Its a bit bulky but I just convert ALL character to html numbered entities (brute force and ignorance does it again.)

I've used this to encode everything from Hebrew to Japanese without problems 

<?
header("Content-Type: text/plain; charset=utf-8"); 
$code = (get_magic_quotes_gpc())?stripslashes($GLOBALS[code]):$GLOBALS[code];
?>
<html>
<head>
    <title>UTF8 ENCODER PAGE</title>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<form method=post action="?seed=<?=time()?>">
    Original Unicode<br />
    <textarea name="code" cols="80" rows="10"><?=$code?></textarea><br />
    Encoded UTF8<br />
    <textarea name="encd" cols="80" rows="10"><?
        for ($i = 0; $i < strlen($code); $i++) {
            echo '&#'.ord(substr($code,$i,1));
        }
    ?></textarea><br />
    <input type="submit" value="encode">
</form>
</body>
</html>

down

-4

Allan ¶

1 year ago

I suppose that from PHP 8.2 we need to use <?php iconv('ISO-8859-1', 'UTF-8', $string) ?> instead.

down

-4

JF Sebastian ¶

19 years ago

The following Perl regular expression tests if a string is well-formed Unicode UTF-8 (Broken up after each | since long lines are not permitted here. Please join as a single line, no spaces, before use.):

^([\x00-\x7f]|
[\xc2-\xdf][\x80-\xbf]|
\xe0[\xa0-\xbf][\x80-\xbf]|
[\xe1-\xec][\x80-\xbf]{2}|
\xed[\x80-\x9f][\x80-\xbf]|
[\xee-\xef][\x80-\xbf]{2}|
f0[\x90-\xbf][\x80-\xbf]{2}|
[\xf1-\xf3][\x80-\xbf]{3}|
\xf4[\x80-\x8f][\x80-\xbf]{2})*$

NOTE: This strictly follows the Unicode standard 4.0, as described in chapter 3.9, table 3-6, "Well-formed UTF-8 byte sequences" ( http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703 ).

ISO-10646, a super-set of Unicode, uses UTF-8 (there called "UCS", see http://www.unicode.org/faq/utf_bom.html#1 ) in a relaxed variant that supports a 31-bit space encoded into up to six bytes instead of Unicode's 21 bits in up to four bytes. To check for ISO-10646 UTF-8, use the following Perl regular expression (again, broken up, see above):

^([\x00-\x7f]|
[\xc0-\xdf][\x80-\xbf]|
[\xe0-\xef][\x80-\xbf]{2}|
[\xf0-\xf7][\x80-\xbf]{3}|
[\xf8-\xfb][\x80-\xbf]{4}|
[\xfc-\xfd][\x80-\xbf]{5})*$

The following function may be used with above expressions for a quick UTF-8 test, e.g. to distinguish ISO-8859-1-data from UTF-8-data if submitted from a <form accept-charset="utf-8,iso-8859-1" method=..>.

function is_utf8($string) {
   return (preg_match('/[insert regular expression here]/', $string) === 1);
}

＋添加备注

官方地址：https://www.php.net/manual/en/function.utf8-encode.php

有任何技术问题请点击这里网站运营推广招聘

IT PHP 编程语言开发编程 Linux 科技 Elasticsearch HTML/CSS/XML 面试数据库网络 JAVA NoSQL C/C++ Golang 操作系统 Git 算法正则表达式 Redis 互联网 MySql 软件运维 JavaScript 国际架构设计 Mac OS TCP/IP Excel Windows Oracle Socket VR Vim MongoDB 运营 Python MemCache 商业硬件电子娱乐设计摄影 nginx WordPress 游戏 HTTP 团建数码电器 Docker 大模型

php7.3 使用 PDO_DM 扩展连接 DM8 中文乱码 PhpStorm中PHP注释的规范指南使用PHPWord将docx文件转换为html格式 docker-compose启动nginx与php-fpm laravel查看orm生成的sql PHPStorm ESC 会退出命令行 composer install参数 laravel orm中DB::insert方法导致内存泄漏的问题解决方法 php7 安装fileinfo扩展 adodb手册 ADORecordSet对象 opcache预加载 ADOConnection 公用函数 Composer的Packagist资源 php 将字符串中的连续多个空格转换为一个空格常用的php ADODB使用方法集锦 adodb连接mysql多个数据库的问题 [鸟哥]PHP_INT_MIN 和 -9223372036854775808 利用PHP SOAP实现WEB SERVICE composer基本用法

略微加速

PHP官方手册 - 互联网笔记

utf8_encode

说明

参数

返回值

更新日志

示例

注释

参见

发现了问题？

用户贡献的备注 24 notes