提问人:cheeseus 提问时间:9/25/2023 更新时间:9/25/2023 访问量:27
使用 PHP 从 TMX (XML) 内容中提取标签
Extract tags from TMX (XML) content using PHP
问:
我正在构建一个基于浏览器的 TMX(翻译记忆库)编辑器。当源和/或目标区段包含标签时,我的内容提取脚本会中断。包含标签的源/目标字符串如下所示:
<tuv xml:lang="BG">
<seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
</tuv>
如果源/目标区段中没有标签,则提取很容易:
$sourceText = $TU['tuv'][0]['seg'];
$targetText = $TU['tuv'][1]['seg'];
但是当标签存在时(在初始位置),我被困在解决方案中。特别是因为这些标签被视为数组。
我不知道如何继续:我可以检查源/目标字符串是否包含/是一个数组,但不确定下一步该怎么做。最终,我需要将文本与标签一起打印,以便用户编辑文本并在必要时移动/删除标签。
这是我的测试代码:
$uploadedFile = "user_files/sample_tmx.tmx";
$xmlStr = file_get_contents($uploadedFile);
$xmlObj = simplexml_load_string($xmlStr);
$arrXml = $util->objectsIntoArray($xmlObj);
$TUs = $arrXml['body']['tu'];
$pattern = '~<.*?>~';
foreach($TUs as $TU) {
if(is_array($TU['tuv'][0]['seg'])) {
var_dump($TU['tuv'][0]['seg']);
$pattern = '~<.*?>~';
$uncleanSource = $TU['tuv'][0]['seg'];
$uncleanTarget = $TU['tuv'][1]['seg'];
foreach($uncleanSource as $unclean) {
//var_dump($unclean);
}
//$sourceText = preg_replace($pattern, "<TAG>", $uncleanSource);
//$targetText = preg_replace($pattern, "<TAG>", $uncleanTarget);
}
else {
$sourceText = $TU['tuv'][0]['seg'];
$targetText = $TU['tuv'][1]['seg'];
}
echo "<p>".$sourceText." = ".$targetText."</p>";
}
这是 sample_tmx.tmx 文件的内容:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd" >
<tmx version="1.4">
<header creationtool="TDC Analysis Package" creationtoolversion="org.gs4tr.tm3.tmx.Version" segtype="sentence" o-tmf="unknown" adminlang="EN-US" srclang="BG" datatype="unknown" creationdate="20221006T184234Z" >
</header>
<body>
<tu creationdate="20201101T133734Z" creationid="Cheeseus" changedate="20201103T151745Z" changeid="Cheeseus" usagecount="0">
<prop type="nextMd5Checksum">e82d0ed6d711aa59310d1e8f4478537e</prop>
<prop type="previousMd5Checksum">39b279e324e9f6cd27351287502eefcb</prop>
<tuv xml:lang="BG">
<seg><bpt x="1" i="1" type="italic"/>Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/></seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg><bpt x="1" i="1" type="italic"/>The -lar/-lar- formant (from the Turkic -lar) is sparsely represented in the Bulgarian word-building system.<ept i="1"/></seg>
</tuv>
</tu>
<tu creationdate="20080812T111221Z" creationid="Cheeseus" changedate="20190825T065920Z" changeid="Cheeseus" usagecount="0">
<tuv xml:lang="BG">
<seg>ПАРТНЬОРИ</seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg>PARTNERS</seg>
</tuv>
</tu>
</body>
</tmx>
答:
0赞
Nigel Ren
9/25/2023
#1
问题似乎来自 ,没有代码使其难以修复。objectsIntoArray()
如果删除该调用并改用 SimpleXML 元素,因为它们是要使用的,则可以使用以下命令调用它(删除其他代码以专注于此问题)...
$xmlObj = simplexml_load_file($uploadedFile);
foreach($xmlObj->body->tu as $TU) {
if(isset($TU->tuv[0]->seg)) {
echo $TU->tuv[0]->seg->asXML();
}
}
使用该方法将重现原始内容,将所有子元素扩展回值。这将给asXML
<seg>
<bpt x="1" i="1" type="italic"/>
Формантът -лар/-лар- (от тюрк. -lar) е слабо представен в българската словообразувателна система.<ept i="1"/>
</seg>
<seg>ПАРТНЬОРИ</seg>
您可能需要做更多的工作才能产生您想要的结果,但希望这显示了如何到达那里。
评论
0赞
cheeseus
9/25/2023
谢谢!这确实有帮助。一个问题是它实际上打印了标签(当您检查 TML 源时可见))。如何跳过它们?第二个问题是标签(在这种情况下隐藏在 HTML 中——我需要以某种方式打印它们。<seg> ... </seg>
<bpt...>
评论
$TU['tuv'][0]['seg']->asXML()
Fatal error: Uncaught Error: Call to a member function asXML() on array
$TU['tuv'][0]['seg'][0]->asXML()
objectsIntoArray