提问人:mrdaliri 提问时间:11/8/2012 更新时间:7/13/2023 访问量:23313
将 Unicode 转换为 HTML 实体十六进制
convert unicode to html entities hex
答:
您的字符串看起来像您可以尝试的编码UCS-4
$first = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($m) {
$char = current($m);
$utf = iconv('UTF-8', 'UCS-4', $char);
return sprintf("&#x%s;", ltrim(strtoupper(bin2hex($utf)), "0"));
}, $string);
输出
string 'Français' (length=13)
评论
Fran
echo bin2hex("Fran");
..你不需要编码
对于相关问题中缺少的十六进制编码:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
这类似于@Baba使用 UTF-32BE 的答案,然后解压缩
和 vsprintf
以满足格式化需求。
如果你更喜欢 iconv
而不是 mb_convert_encoding
,它是类似的:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$binary = iconv('UTF-8', 'UTF-32BE', $utf8);
$entity = vsprintf('&#x%X;', unpack('N', $binary));
return $entity;
}, $input);
我发现这个字符串操作比在获取 html 实体的十六进制代码中更清晰一些。
评论
首先,当我最近遇到这个问题时,我通过确保我的代码文件、数据库连接和数据库表都是 UTF-8 来解决它,然后,简单地回显文本即可。如果必须转义数据库的输出,请使用 UTF-8 符号,而不是 UTF-8 符号,而不是试图转义。htmlspecialchars()
htmlentities()
想记录一个替代解决方案,因为它为我解决了类似的问题。
我使用PHP来转义“特殊”字符。utf8_encode()
我想将它们转换为用于显示的 HTML 实体,我编写此代码是因为我想尽可能避免 iconv 或此类函数,因为并非所有环境都一定具有它们(如果不是这样,请纠正我!
function unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
$foo = 'This is my test string \u03b50';
echo unicode2html($foo);
希望这能帮助有需要的人:-)
请参阅如何在 PHP 中从 unicode 码位获取字符?,了解一些允许您执行以下操作的代码:
示例使用:
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
输出 :
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
您也可以使用 PHP 4.0.6+ 支持的 PHP 文档。mb_encode_numericentity
function unicode2html($value) {
return mb_encode_numericentity($value, [
// start codepoint
// | end codepoint
// | | offset
// | | | mask
0x0000, 0x001F, 0x0000, 0xFFFF,
0x0021, 0x002C, 0x0000, 0xFFFF,
0x002E, 0x002F, 0x0000, 0xFFFF,
0x003C, 0x003C, 0x0000, 0xFFFF,
0x003E, 0x003E, 0x0000, 0xFFFF,
0x0060, 0x0060, 0x0000, 0xFFFF,
0x0080, 0xFFFF, 0x0000, 0xFFFF
], 'UTF-8', true);
}
通过这种方式,还可以指示要转换为十六进制实体的字符范围以及要保留的字符范围。
使用示例:
$input = array(
'"Meno più, PIÙ o meno"',
'\'ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà\'',
'<script>alert("XSS");</script>',
'"`'
);
$output = array();
foreach ($input as $str)
$output[] = unicode2html($str)
结果:
$output = array(
'"Meno più, PIÙ o meno"',
''ÀÌÙÒLÈ PERCHÉ perché è sempre così non si sà'',
'<script>alert("XSS");</script>',
'"`'
);
这是像@hakre(2012年11月8日0:35)一样的解决方案,但对html实体名称:
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a";
// => $output: "Obóz wędrowny Koła"
//while @hakre/@Baba both codes:
// => $output: "Obóz wędrowny Koła"
但总是遇到不正确的 UTF-8 的问题,即:
$input = "Ob\xC3\xB3z w\xC4\x99drowny Ko\xC5\x82a - ok\xB3adka";
// means "Obóz wędrowny Koła - - okładka" in html ("\xB3" is ISO-8859-2/windows-1250 "ł")
但在这里
// => $output: (empty)
还有@hakre代码... :(
很难找出原因,这是我所知道的唯一解决方案(也许有人知道更简单的解决方案吗?
function utf_entities($input) {
$output = preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) {
list($utf8) = $match;
$char = htmlentities($utf8, ENT_HTML5 | ENT_IGNORE);
if ($char[0]!=='&' || (strlen($char)<2)) {
$binary = mb_convert_encoding($utf8, 'UTF-32BE', 'UTF-8');
$char = vsprintf('&#x%X;', unpack('N', $binary));
} // (else $char is "&entity;", which is better)
return $char;
}, $input);
if (empty($output) && (!empty($input))) { // Trouble... Maybe not UTF-8 code inside UTF-8 string...
/* Processing string against not UTF-8 chars... */
$output = ''; // New - repaired
for ($i=0; $i<strlen($input); $i++) {
if (($char = $input[$i])<"\x80") {
$output .= $char;
} else { // maybe UTF-8 (0b ..110xx..) or not UTF-8 (i.e. 0b11111111 etc.)
$j = 0; // how many chars more in UTF-8
$char = ord($char);
do { // checking first UTF-8 code char bits
$char = ($char << 1) % 0x100;
$j++;
} while (($j<4 /* 6 before RFC 3629 */)&& (($char & 0b11000000) === 0b11000000));
$k = $i+1;
if ($j<4 /* 6 before RFC 3629 */ && (($char & 0b11000000) === 0b10000000)) { // maybe UTF-8...
for ($k=$i+$j; $k>$i && ((ord($input[$k]) & 0b11000000) === 0b10000000); $k--) ; // ...checking next bytes for valid UTF-8 codes
}
if ($k>$i || ($j>=4 /* 6 before RFC 3629 */) || (($char & 0b11000000) !== 0b10000000)) { // Not UTF-8
$output .= '&#x'.dechex(ord($input[$i])).';'; // "&#xXX;"
} else { // UTF=8 !
$output .= substr($input, $i, 1+$j);
$i += $j;
}
}
}
return utf_entities($output); // recursively after repairing
}
return $output;
}
即:
echo utf_entities("o\xC5\x82a - k\xB3a"); // oła - k³a - UTF-8 + fixed
echo utf_entities("o".chr(0b11111101).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// oñ¸¸¸¸¸a - invalid UTF-8 (6-bytes UTF-8 valid before RFC 3629), fixed
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a - k\xB3a");
// o񸸸a - k³a - UTF-8 + fixed ("\xB3")
echo utf_entities("o".chr(0b11110001).chr(0b10111000).chr(0b10111000).chr(0b10111000)."a");
// o񸸸a - valid UTF-8!
echo utf_entities("o".chr(0b11110001).'a'.chr(0b10111000).chr(0b10111000)."a");
// oña¸¸a - invalid UTF-8, fixed
另一种选择,它建立在其他一些答案中的想法之上,但不依赖于 mbstring 或 iconv。(实体是十进制的,但可以通过在返回之前添加调用来轻松更改为十六进制,当然还可以在字符串中添加一个“x”。如果这是您的要求;当我发现这个问题时,它不适合我。bin2hex
/**
* Convert all non-ascii unicode (utf-8) characters in a string to their HTML entity equivalent.
*
* Only UTF-8 is supported, as we don't have access to mbstring.
*/
function unicode2html($string) {
return preg_replace_callback('/[^\x00-\x7F]/u', function($matches){
// Adapted from https://www.php.net/manual/en/function.ord.php#109812
$offset = 0;
$code = ord(substr($matches[0], $offset,1));
if ($code >= 128) { //otherwise 0xxxxxxx
if ($code < 224) $bytesnumber = 2; //110xxxxx
else if ($code < 240) $bytesnumber = 3; //1110xxxx
else if ($code < 248) $bytesnumber = 4; //11110xxx
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($matches[0], $offset, 1)) - 128; //10xxxxxx
$codetemp = $codetemp*64 + $code2;
}
$code = $codetemp;
}
return "&#$code;";
}, $string);
}
评论
mb_convert_encoding($utf_8, 'HTML 实体', 'UTF-8');
例如,适用于 PHP 中的 UTF-8 unicode 字符串。如果您需要十六进制编码,链接的答案会向您展示如何捕获所有这些(从 utf-8 字符串中),您只需要运行十六进制编码。UTF-8
mb_convert_encoding($utf_8, 'HTML-ENTITIES', 'UTF-8');
hex
Français
Français