我正在尝试在PHP中解析希伯来语单词。它作为一个字符串看起来没问题,但是当我尝试将其拆分为字符时,它将无法正确显示

I'm trying to parse a Hebrew word in PHP. It looks ok as a string, but when I try to split it out into characters it won't display correctly

提问人:ken 提问时间:10/28/2023 最后编辑:Barmarken 更新时间:10/31/2023 访问量:68

问:

这是我简化的测试代码:

<!DOCTYPE html>
    <?php
        //uncommenting the next line results in the whole page displaying in "chinese -simplified"
        //header("content-type: text/html; charset=UTF-16");
        header('Content-language: he');
    ?>
<html>
<head>
    <meta http-equiv=Content-Type content="text/html; charset=UTF-16">
    <meta http-equiv="content-language" content="he-il">
</head>
<body>
<?php
        // in Production, we are grabbing the hebrew word from the database
        //$sql = "SELECT masoretic FROM codex WHERE id = 20"; // just grabs a word from the database
                                                            // it is stored using UTF16_general_ci on mySQL
        // in this test we can mock the exact same data that was copy and pasted in
        // the results were the same with the data from the db
            $masoretic = "בָּרָ֣א";

            echo $masoretic . '<br>'; // displays correctly in HEBREW = בָּרָ֣א
            // now loop through the word and process each letter
            $length = strlen($masoretic);
            // even though there are only 3 real letters, the diacritic marks count as characters, so we should get at least 7 loops
            for ($x = 0; $x <= $length; $x++) {
                $letter = substr($masoretic,0,1); // process this letter
                $masoretic = substr($masoretic, 1); // the rest of the word
                $name = '';
                $recognized = false;
                switch($letter){
                    case 'ר':
                        $recognized = true;
                        $name = 'Raysh';
                        break;
                    case 'א':
                        $recognized = true;
                        $name = 'Aleph';
                        break;
                    default:
                        $recognized = false;
                        break;
                }
                if($recognized){
                    echo ('found a ' . $name);
                    echo $letter; // for now just display it
                }else{
                        echo 'unrecognized letter:';
                        print_r($letter);
                        echo '<br>';
                }                       
            }           
    ?>
</body>

页面显示如下:

בָּרָ֣א
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:�
unrecognized letter:

我发现很奇怪,完整的希伯来语单词显示正常,但每个单独的字母都不会显示。我认为 UTF16 有一些时髦的事情,所以我添加了标头,但在某些情况下,这实际上使情况变得更糟。(见内联注释)

PHP 字符串 解析 UTF-16 希伯来语

评论

1赞 Barmar 10/28/2023
您需要使用字符串函数,因为希伯来字母是多字节的。mb_XXX
0赞 mickmackusa 10/28/2023
除了发表评论之外,您还想采取任何行动@Barmar......stackoverflow.com/a/68386500/2943403
0赞 mickmackusa 10/28/2023
相关 PHP: 如何拆分 UTF-8 字符串? 和 将字符串转换为字符数组 - 多字节和 在 PHP 和 Php 中拆分、计数和格式化多字节字符 找不到拆分 utf-8 字符串 https://stackoverflow.com/questions/2590980/parsing-multibyte-string-in-php 的方法
0赞 mickmackusa 10/29/2023
您不需要,因为在开关块之前默认为 false。default: $recognized = false; break;

答:

-1赞 Sammitch 10/28/2023 #1

在 UTF-16 中,字形将由 2-4 个字节表示,因此您需要使用多字节感知字符串函数,例如:mb_str_split()。

// input in in 8 and conversion to 16 since everything on SO is UTF-8
$in_8  = 'בָּרָ֣א';
$in_16 = mb_convert_encoding($in_8, 'UTF-16', 'UTF-8');

foreach(mb_str_split($in_16, 1, 'UTF-16') as $glyph_16) {
    // covert back for example display in UTF-8
    $glyph_8 = mb_convert_encoding($glyph_16, 'UTF-8', 'UTF-16');
    printf("%s %s\n",bin2hex($glyph_16), $glyph_8);
}

您应该能够在自己的代码中省略转换,这些转换将有利于像我这样不使用 UTF-16 的人。

输出:

05d1 ב
05b8 ָ
05bc ּ
05e8 ר
05b8 ָ
05a3 ֣
05d0 א

评论

1赞 mickmackusa 10/28/2023
关闭重复问题的基本目标是通过将所有这些答案集中在一个地方来帮助人们找到正确的答案。我将保留这个问题,以便您可以展示作为 PHP Collective 公认成员应有的良好管理。这个问题不需要在 Stack Overflow 上给出新的答案。
1赞 mickmackusa 10/28/2023
请参阅集合体中的私人讨论“未将 Stack Overflow 放在首位的已识别成员”。
1赞 Sammitch 10/31/2023
@ken 您可能已经解决了这个问题,但我已经更新了答案,使其与 UTF-16 相关。我想我在第一次阅读时掩盖了特定的编码。