如何使用 php 从 html 中提取 img src、title 和 alt？[复制]-解网

问：

4年前关闭。

我想创建一个页面，其中列出了我网站上的所有图像，并带有标题和替代表示。

我已经给我写了一个小程序来查找和加载所有 HTML 文件，但现在我被困在如何提取，并从这个 HTML 中：srctitlealt

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />

我想这应该用一些正则表达式来完成，但由于标签的顺序可能会有所不同，并且我需要所有这些标签，我真的不知道如何以优雅的方式解析它（我可以通过 char 方式完成硬 char，但这很痛苦）。

php 正则表达式解析 html-content-extraction

1赞 derobert 2/1/2012

请参阅如何使用 PHP 解析和处理 HTML？

0赞 Karra Max 11/22/2021

[code]$html = '<img border=“0” src=“/images/image.jpg” alt=“Image” width=“100” height=“100” />';preg_match（ '@src=“（[^”]+）“@' ， $html， $match ）;$src = array_pop（$match）;将返回 /images/image.jpg echo $src;[代码] //paulund.co.uk/get-image-src-with-php

答：

68赞 Stefan Gehrig 9/26/2008 #1

举个小例子，说明如何使用PHP的XML功能来完成任务：

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

我确实使用了该方法，因为此方法可以处理 HTML 语法，并且不会强制输入文档为 XHTML。严格来说，转换为 a 是没有必要的 - 它只是使使用 xpath 和 xpath 结果更加简单。DOMDocument::loadHTML()SimpleXMLElement

1赞 Alex Polo 8/16/2010

当然，这种方法非常简单，但有人可能希望在调用 loadHTML 方法（@$doc->loadHTML）时使用 @ 符号，因为它可以防止显示警告。

1赞 Matt 1/14/2012

事先调用此函数以正常处理错误：。您还可以使用libxml_use_internal_errors( true );libxml_get_errors()

10赞 DreamWerx 9/26/2008 #2

如果是 XHTML，则示例是，您只需要 simpleXML。

<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>

输出：

object(SimpleXMLElement)#1 (1) {
  ["@attributes"]=>
  array(3) {
    ["src"]=>
    string(22) "/image/fluffybunny.jpg"
    ["title"]=>
    string(16) "Harvey the bunny"
    ["alt"]=>
    string(26) "a cute little fluffy bunny"
  }
}

217赞 Bite code 9/27/2008 #3

编辑：现在我知道得更好了

使用正则表达式来解决此类问题是一个坏主意，并且可能会导致无法维护和不可靠的代码。最好使用 HTML 解析器。

使用正则表达式的解决方案

在这种情况下，最好将该过程分为两部分：

获取所有 IMG 标签
提取其元数据

我假设你的文档不是xHTML严格的，所以你不能使用XML解析器。例如，使用此网页源代码：

/* preg_match_all match the regexp in all the $html string and output everything as 
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)

然后我们通过循环获取所有 img 标签属性：

正则表达式是 CPU 密集型的，因此您可能需要缓存此页面。如果您没有缓存系统，您可以通过使用ob_start并从文本文件加载/保存来调整自己的缓存系统。

这些东西是如何工作的？

首先，我们使用 preg_ match_ all，一个函数，它获取与模式匹配的每个字符串，并将其放入第三个参数中。

正则表达式：

<img[^>]+>

我们将其应用于所有html网页。它可以被理解为每个以“<img”开头、包含非“>”字符并以>结尾的字符串。

(alt|title|src)=("[^"]*")

我们依次将其应用于每个 img 标签。它可以读作每个字符串，以 “alt”、“title” 或 “src” 开头，然后是 “=”，然后是 ' “ '，一堆不是 ' ” ' 的东西，并以 ' “ ' 结尾。在（）之间隔离子字符串。

最后，每次你想处理正则表达式时，有好的工具来快速测试它们会很方便。检查这个在线正则表达式测试器。

编辑：回答第一条评论。

的确，我没有考虑过（希望很少）使用单引号的人。

好吧，如果你只使用'，只需将所有“替换为'即可。

如果两者混合。首先你应该打自己一巴掌：-），然后尝试使用（“|'）代替 “ 和 [^ø] 替换 [^”]。

0赞 Sam 10/1/2008

唯一的问题是单引号：<img src='picture.jpg'/> 将不起作用，正则表达式一直期望”

0赞 Bite code 10/4/2008

Tre 我的朋友。我对此添加了一个注释。谢谢。

1赞 patrick 11/8/2014

我不建议向下滚动（好的，滚动查看）：尽管代码看起来很简单，因此对人们来说很诱人，但当您只想从标签中获得属性时，DOMDocument 的开销如此之大......

0赞 viion 2/27/2015

如果出现以下情况，此解决方案很好：您不知道要解析的 html 的标签，您有 1 行 html，需要 1-2 个属性。加载 DOMDoc 会产生大量内存开销，如果不解析整个文档，则这些开销毫无用处。

1赞 mgutt 3/20/2015

这不包括或alt=fooalt='foo'

5赞 Bakudan 9/27/2009 #4

脚本必须像这样编辑

foreach( $result[0] as $img_tag)

因为preg_match_all返回数组的数组

273赞 karim 5/30/2010 #5

$url="http://example.com";

$html = file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');

foreach ($tags as $tag) {
       echo $tag->getAttribute('src');
}

0赞 321zeno 11/2/2011

我很好奇这是否比preg_match跑得更快

5赞 Dylan Valade 11/16/2011

我喜欢这是多么容易阅读！XPath 和 Regex 也可以使用，但 18 个月后阅读起来就不那么容易了。

1赞 patrick 11/8/2014

虽然短小精悍，但却是巨大的资源浪费......这意味着使用 DOMDocument 从标记中提取属性是大量（!!）开销

0赞 vaneayoung 1/23/2016

如何限制，例如最多 10 张图像？

0赞 Angry 84 11/15/2016

撇开资源不谈，这取决于用例。有些人最终通过学习一个简单的答案来写出 100 个正则表达式。

7赞 WNRosenberg 9/29/2010 #6

我曾经用preg_match来做这件事。

就我而言，我有一个字符串，只包含一个标签（没有其他标记），这是我从 Wordpress 获得的，我试图获取该属性，以便我可以通过 timthumb 运行它。<img>src

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);

// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);

在抓取标题或 alt 的模式中，您可以简单地用于抓取标题或抓取 alt。可悲的是，我的正则表达式还不够好，无法一次通过所有三个（alt/title/src）。$pattern = '/title="([^"]*)"/';$pattern = '/title="([^"]*)"/';

2赞 numediaweb 11/25/2014

如果 img 标签属性在单引号中，则不起作用;<img src='image.png'>

0赞 mickmackusa 5/15/2019

你不是要回答“为了你的情况”，你要回答OP的确切/精确的问题。

-1赞 Xavier 11/13/2010 #7

这是PHP中的解决方案：

只需下载 QueryPath，然后执行以下操作：

$doc= qp($myHtmlDoc);

foreach($doc->xpath('//img') as $img) {

   $src= $img->attr('src');
   $title= $img->attr('title');
   $alt= $img->attr('alt');

}

就是这样，大功告成！

2赞 nwalke 4/6/2013

不。这不是解决方案。

1赞 John Daliani 11/9/2011 #8

这是一个PHP函数，我从上述所有信息中蹒跚而行，目的类似，即动态调整图像标签宽度和长度属性......也许有点笨拙，但似乎工作可靠：

function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {

// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER); 

// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
    array_push($imagearray, $rawimagearray[$i][0]);
}

// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {

    preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}

// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {

    $ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
    $OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
    $OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);

    $NewWidth = $OrignialWidth; 
    $NewHeight = $OrignialHeight;
    $AdjustDimensions = "F";

    if($OrignialWidth > $MaximumWidth) { 
        $diff = $OrignialWidth-$MaximumHeight; 
        $percnt_reduced = (($diff/$OrignialWidth)*100); 
        $NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100)); 
        $NewWidth = floor($OrignialWidth-$diff); 
        $AdjustDimensions = "T";
    }

    if($OrignialHeight > $MaximumHeight) { 
        $diff = $OrignialHeight-$MaximumWidth; 
        $percnt_reduced = (($diff/$OrignialHeight)*100); 
        $NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100)); 
        $NewHeight= floor($OrignialHeight-$diff); 
        $AdjustDimensions = "T";
    } 

    $thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
    array_push($AllImageInfo, $thisImageInfo);
}

// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {

    if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
        $NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
        $NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);

        $thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
        array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
    }
}

// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
    $HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}

return $HTMLContent;

}

6赞 Nauphal 1/27/2014 #9

你可以使用 simplehtmldom。simplehtmldom 支持大多数 jQuery 选择器。下面给出了一个示例

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

2赞 mickmackusa 5/15/2019 #10

我已经阅读了此页面上的许多评论，这些评论抱怨使用 dom 解析器是不必要的开销。好吧，它可能比单纯的正则表达式调用更昂贵，但 OP 表示无法控制 img 标签中属性的顺序。这一事实导致了不必要的正则表达式模式卷积。除此之外，使用 dom 解析器还提供了可读性、可维护性和 dom 感知（正则表达式不是 dom 感知）的额外好处。

我喜欢正则表达式，我回答了很多正则表达式的问题，但是在处理有效的 HTML 时，很少有充分的理由使用正则表达式而不是解析器。

在下面的演示中，了解 DOMDocument 如何简单干净地处理 img 标签属性，并混合使用引号（和根本不引号）。另请注意，没有目标属性的标签根本不会造成中断 - 空字符串作为值提供。

代码：（演示)

$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;

libxml_use_internal_errors(true);  // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
    echo "IMG#{$i}:\n";
    echo "\tsrc = " , $img->getAttribute('src') , "\n";
    echo "\ttitle = " , $img->getAttribute('title') , "\n";
    echo "\talt = " , $img->getAttribute('alt') , "\n";
    echo "---\n";
}

输出：

IMG#0:
    src = /image/fluffybunny.jpg
    title = Harvey the bunny
    alt = a cute little fluffy bunny
---
IMG#1:
    src = /image/pricklycactus.jpg
    title = Roger the cactus
    alt = a big green prickly cactus
---
IMG#2:
    src = /image/noisycockatoo.jpg
    title = Polly the cockatoo
    alt = an annoying white cockatoo
---
IMG#3:
    src = somethingelse
    title = something
    alt = 
---

在专业代码中使用这种技术将给你留下一个干净的脚本，更少的打嗝，以及更少的同事希望你在其他地方工作。

上一个：如何在PHP中解析和处理HTML/XML？

下一个：如何使用 Java 有效地解析 HTML？

如何使用 php 从 html 中提取 img src、title 和 alt？[复制]

How to extract img src, title and alt from html using php? [duplicate]

评论

评论

编辑：现在我知道得更好了

使用正则表达式的解决方案

这些东西是如何工作的？

评论

评论

评论

评论