提问人:DeN 提问时间:2/27/2022 最后编辑:DeN 更新时间:2/28/2022 访问量:212
用于在文本中查找链接的正则表达式
Regular expression for find links in text
问:
请帮我,请编写正则表达式以查找所有链接(.com|。org|。ru) 在文本中没有标记 <a>。
Example text:
1. https://www.cyberforum.ru/newthread.php?do=newthread&f=323
2. www.cyberforum.ru
3. <a href="https://www.cyberforum.ru/newthread.php?do=newthread&f=323">www.cyberforum.ru/newthread.php?do=newthread&f=323</a>
4. <a href="www.cyberforum.ru/newthread.php?do=newthread&f=323">www.cyberforum.ru/newthread.php?do=newthread&f=323</a>
项目 1,2 应与正则表达式匹配,但 3,4 - 否。
我试过/(?<![“'<>])(\b(https?://)?([\w.](com|org|ru)[\w.?&=/])\b)/ 但它不能正常工作。
答:
0赞
DeN
2/28/2022
#1
我的解决方案:
/**
* Wraps links in <a></a> tag.
* Skip links which are in href or in <a></a> tag already.
*
* @param string $text
* @return string
*/
private static function replaceLinks(string $text): string
{
return preg_replace_callback(
'/\b(https?:\/\/)?([\w.-]*(\.com|\.org|\.ru|\.local)[\w.?&=\/]*)\b/',
function ($matches) use ($text) {
// checks previous char, skip links which are in href or in <a></a> tag
$previousChar = $matches[0][1] > 0 ? $text[--$matches[0][1]] : '';
if (!in_array($previousChar, ['"', '\'', '<', '>', ';'])) {
return "<a target='_blank' href=\"{$matches[0][0]}\">{$matches[0][0]}</a>";
}
// without replace
return $matches[0][0];
},
$text,
-1,
$cont,
PREG_OFFSET_CAPTURE
);
}
0赞
ThW
2/28/2022
#2
这是一种将 DOM 与正则表达式相结合的方法。它限制了对 内部元素节点的文本内容的更改,并避免修改其他节点,如注释或属性。body
会发生什么情况:
- 遍历文本节点,避免使用现有的链接元素
/html/body//text()[not(ancestor::a)]
- 使用 preg_split() 通过匹配 http(s) URL 来分隔文本
- 遍历该列表并将它们(添加到片段)作为链接(如果它是 URL)或作为文本节点(如果不是)。
- 将原始文本节点替换为新片段。
$html = <<<'HTML'
<html>
<body>
Some link http://example.tld to replace.
<div>Another link http://example.tld/another to replace.</div>
<a href="http://example.tld/in-link">http://example.tld/in-link</a>
</body>
</html>
HTML;
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
$linkPattern = '\b(?:https?:\/\/)(?:[\w.?&=\/-]*)\b';
$splitPattern = '(('.$linkPattern.'))';
$matchPattern = '(^'.$linkPattern.'$)';
// iterate over text nodes inside the body
$expression = '/html/body//text()[not(ancestor::a)]';
foreach ($xpath->evaluate($expression) as $textNode) {
// split the text content at the search string and capture any part
$parts = preg_split(
$splitPattern,
$textNode->textContent,
-1,
PREG_SPLIT_DELIM_CAPTURE
);
// here should be at least two parts
if (count($parts) < 2) {
continue;
}
// fragments allow to treat several nodes like one
$fragment = $document->createDocumentFragment();
foreach ($parts as $part) {
// it's an URL
if (preg_match($matchPattern, $part)) {
// create the new a
$fragment->appendChild(
$a = $document->createElement('a')
);
$a->setAttribute('href', $part);
$a->textContent = $part;
} else {
// add the part as a new text node
$fragment->appendChild($document->createTextNode($part));
}
}
// replace the text node with the fragment
$textNode->parentNode->replaceChild($fragment, $textNode);
}
echo $document->saveHTML();
评论
<a href="
<a href='
strpos()
\K
(?:<a.*?a>)\K|(\b(?:https?:|www\.)([\w.]+\.(?:com|org|ru)\S*))