使用 PHP 从 TMX (XML) 内容中提取标签

Extract tags from TMX (XML) content using PHP

提问人:cheeseus 提问时间:9/25/2023 更新时间:9/25/2023 访问量:27

问:

我正在构建一个基于浏览器的 TMX(翻译记忆库)编辑器。当源和/或目标区段包含标签时,我的内容提取脚本会中断。包含标签的源/目标字符串如下所示:

<tuv xml:lang="BG">
  <seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
</tuv>
<tuv xml:lang="EN-GB">
  <seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
</tuv>

如果源/目标区段中没有标签,则提取很容易:

$sourceText = $TU['tuv'][0]['seg'];
$targetText = $TU['tuv'][1]['seg'];

但是当标签存在时(在初始位置),我被困在解决方案中。特别是因为这些标签被视为数组。

我不知道如何继续:我可以检查源/目标字符串是否包含/是一个数组,但不确定下一步该怎么做。最终,我需要将文本与标签一起打印,以便用户编辑文本并在必要时移动/删除标签。

这是我的测试代码:

$uploadedFile = "user_files/sample_tmx.tmx";

$xmlStr = file_get_contents($uploadedFile);
$xmlObj = simplexml_load_string($xmlStr);
$arrXml = $util->objectsIntoArray($xmlObj);
$TUs = $arrXml['body']['tu'];

$pattern = '~<.*?>~';

foreach($TUs as $TU) {

    if(is_array($TU['tuv'][0]['seg'])) {

        var_dump($TU['tuv'][0]['seg']);

        $pattern = '~<.*?>~';
        $uncleanSource = $TU['tuv'][0]['seg'];
        $uncleanTarget = $TU['tuv'][1]['seg'];

        foreach($uncleanSource as $unclean) {
            //var_dump($unclean);
        }

        //$sourceText = preg_replace($pattern, "&lt;TAG&gt;", $uncleanSource);
        //$targetText = preg_replace($pattern, "&lt;TAG&gt;", $uncleanTarget);
    }
    else {
        $sourceText = $TU['tuv'][0]['seg'];
        $targetText = $TU['tuv'][1]['seg'];
    }

    echo "<p>".$sourceText." = ".$targetText."</p>";
}

这是 sample_tmx.tmx 文件的内容:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd" >
<tmx version="1.4">
<header creationtool="TDC Analysis Package" creationtoolversion="org.gs4tr.tm3.tmx.Version" segtype="sentence" o-tmf="unknown" adminlang="EN-US" srclang="BG" datatype="unknown" creationdate="20221006T184234Z" >
</header>
<body>
<tu creationdate="20201101T133734Z" creationid="Cheeseus" changedate="20201103T151745Z" changeid="Cheeseus" usagecount="0">
    <prop type="nextMd5Checksum">e82d0ed6d711aa59310d1e8f4478537e</prop>
    <prop type="previousMd5Checksum">39b279e324e9f6cd27351287502eefcb</prop>
    <tuv xml:lang="BG">
      <seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
    </tuv>
    <tuv xml:lang="EN-GB">
      <seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
    </tuv>
  </tu>
  <tu creationdate="20080812T111221Z" creationid="Cheeseus" changedate="20190825T065920Z" changeid="Cheeseus" usagecount="0">
    <tuv xml:lang="BG">
      <seg>ПАРТНЬОРИ</seg>
    </tuv>
    <tuv xml:lang="EN-GB">
      <seg>PARTNERS</seg>
    </tuv>
  </tu>
</body>
</tmx>
php xml 解析

评论

0赞 Nigel Ren 9/25/2023
改变它有帮助吗?$TU['tuv'][0]['seg']->asXML()
0赞 cheeseus 9/25/2023
@NigelRen,不,它没有。它抛出一个错误:.Fatal error: Uncaught Error: Call to a member function asXML() on array
0赞 Nigel Ren 9/25/2023
所以你试过了$TU['tuv'][0]['seg'][0]->asXML()
0赞 cheeseus 9/25/2023
@NigelRen,我只是做了:我得到了同样的错误。
0赞 Nigel Ren 9/25/2023
看起来无论对内容做什么,它都会使结构难以解决。如果您错过了这一步,您可以获得如上所示的结果。(尽管您也必须重新编写代码)objectsIntoArray

答:

0赞 Nigel Ren 9/25/2023 #1

问题似乎来自 ,没有代码使其难以修复。objectsIntoArray()

如果删除该调用并改用 SimpleXML 元素,因为它们是要使用的,则可以使用以下命令调用它(删除其他代码以专注于此问题)...

$xmlObj = simplexml_load_file($uploadedFile);

foreach($xmlObj->body->tu as $TU) {
    if(isset($TU->tuv[0]->seg)) {
        echo $TU->tuv[0]->seg->asXML();
    }
}

使用该方法将重现原始内容,将所有子元素扩展回值。这将给asXML

<seg>
                    <bpt x="1" i="1" type="italic"/>
Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/>
                </seg>

<seg>ПАРТНЬОРИ</seg>

您可能需要做更多的工作才能产生您想要的结果,但希望这显示了如何到达那里。

评论

0赞 cheeseus 9/25/2023
谢谢!这确实有帮助。一个问题是它实际上打印了标签(当您检查 TML 源时可见))。如何跳过它们?第二个问题是标签(在这种情况下隐藏在 HTML 中——我需要以某种方式打印它们。<seg> ... </seg><bpt...>