查找并替换文本 blob 中的 URL，但排除链接标记中的 URL-解网

问：

我一直在尝试运行一个字符串并查找并用链接替换 URL，这是我到目前为止所提出的，它似乎在大多数情况下确实工作得很好，但是我想打磨一些东西。此外，它可能不是执行此操作的最佳方式。

我在 SO 上读过很多关于这个问题的帖子，虽然它有很大帮助，但我仍然需要把松散的结局绑起来。

我在字符串中跑了两次。我第一次用 html 标签替换 bbtags;第二次我运行字符串并用链接替换文本 url：

$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '<a href="\1" rel="nofollow">\2</a>', $body_str);

$body_str = preg_replace_callback(
    '!(?:^|[^"\'])(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?!',
    function ($matches) {
        return strpos(trim($matches[0]), 'thisone.com') == FALSE ?
        '<a href="' . ltrim($matches[0], " \t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '" rel="nofollow">' . ltrim($matches[0], "\t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '</a>' :
        '<a href="' . ltrim($matches[0], " \t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '">' . ltrim($matches[0], "\t\n\r\0\x0B.,@?^=%&amp;:/~\+#'") . '</a>';
    },
    $body_str
);

到目前为止，我发现的几个问题是它倾向于在“http”等之前立即拾取字符，例如空格/逗号/冒号等，这会破坏链接。因此，我使用preg_replace_callback来解决这个问题，并修剪一些会破坏链接的不需要的字符。

另一个问题是，为了避免通过匹配已经在 A 标签中的 url 来破坏链接，我目前正在排除以引号、双引号开头的 url，我宁愿使用 href='|href=“ 进行排除。

任何提示和建议将不胜感激

php 正则表达 html 解析 preg-replace-callback

不要使用正则表达式来解析 HTML。使用适当的 HTML 解析模块。你无法可靠地使用正则表达式解析 HTML，并且你将面临悲伤和沮丧。一旦 HTML 从您的期望中改变，您的代码就会被破坏。请参阅 htmlparsing.com/php，了解如何使用已经编写、测试和调试的 PHP 模块正确解析 HTML。

答：

0赞 Thibault 8/9/2013 #1

首先，我允许自己重构一下您的代码，使其更易于阅读和修改：

function urltrim($str) {
   return ltrim($str, " \t\n\r\0\x0B.,@?^=%&:/~\+#'");
}
function addlink($str,$nofollow=true) {
        return '<a href="' . urltrim($str) . '"'.($nofollow ? ' rel="nofollow"' : '').'>' . urltrim($str) . '</a>';
}
function checksite($str) {
        return strpos(trim($str), 'thisone.com') == FALSE ?  addlink($str) : addlink($str,false);
}

$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '\2', $body_str);

$body_str = preg_replace_callback(
    '!(?:^|[^"\'])(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?!',
       function ($matches) {
        return checksite($matches[0]);
    },

    $body_str
);

在那之后，我改变了你处理链接的方式：

我认为 URL 是一个单词（= 所有字符，直到您找到空格或 \n 或 \t （=\s））
我更改了匹配方法以匹配字符串前面存在 href=
- 如果它存在，那么我什么都不做，它已经是一个链接了
- 如果没有 href= 存在，那么我替换链接
所以 urltrim 方法不再有用了，因为我不会吃掉 http 之前的第一个字符
当然，我使用 urlencode 对 url 进行编码并避免 html 注入

function urltrim($str) {
    return $str;
}
function addlink($str,$nofollow=true) {
        $url = preg_replace("#(https?)%3A%2F%2F#","$1://",urlencode(urltrim($str)));
        return '<a href="' . $url . '"'.($nofollow ? ' rel="nofollow"' : '').'>' . urltrim($str) . '</a>';
}
function checksite($str) {
        return strpos(trim($str), 'thisone.com') == FALSE ?  addlink($str) : addlink($str,false);
}

$body_str = preg_replace('/\[url=(.+?)\](.+?)\[\/url\]/i', '\2', $body_str);

$body_str = preg_replace_callback(
    '!(|href=)(["\']?)(https?://[^\s]+)!',
    function ($matches) {
        if ($matches[1]) {
            # If href= is present, dont do anything, return the original string
            return $matches[0];
        } else {
            # add the previous char (" or ') and the link
            return $matches[2].checksite($matches[3]);
        }
    },
    $body_str
);

我希望这对你的项目有所帮助。告诉我们是否有帮助。

再见。

上一个：如何使用 libtidy 从 html dom 节点获取特定属性

下一个：在简单 HTML Dom Parser 中增加MAX_FILE_SIZE后的副作用？

查找并替换文本 blob 中的 URL，但排除链接标记中的 URL

Find and replace URLs in a blob of text but exclude those in link tags

评论