从外部页面抓取 DIV 中的特定元素

Grabbing specific elements inside DIV from external page

提问人:toscho 提问时间:9/7/2022 更新时间:9/7/2022 访问量:81

问:

我需要删除这些 div 中的每一个中的以下元素(页面包含其中的几个),但实际上我不知道该怎么做......所以,我需要帮助不要拔掉我的头发。class="product-grid-item"

1 - div 中的链接和图像:class="product-element-top2;

<a href="https://...this_link" class="product-image-link"> (只需要链接)

<img width="300" height="300" src="https://...this_image_url... (只需要这个图片网址)

2 - h3 标签内的标题,如下所示;

<h3 class="wd-entities-title"><a href="https://...linkhere">The title goes here (只是标题)

3 - 最后但并非最不重要的一点是,我需要抓住这个价格;

<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">€</span>20,00</bdi></span></span> (只是“20.00 欧元”)

以下是完整的 HTML

<div class="product-grid-item" data-loop="1">

<div class="product-element-top">
    <a href="https://...linkhere" class="product-image-link">
        <img width="300" height="300" src="https://image-goes-here.jpg" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail">    </a>
    
    <div class="top-information wd-fill">

        <h3 class="wd-entities-title"><a href="https://...linkhere">The title goes here</a></h3>        
                
        
    <span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">€</span>20,00</bdi></span></span>

        <div class="wd-add-btn wd-add-btn-replace woodmart-add-btn">
            <a href="https://...linkhere" data-quantity="1" class="button product_type_variable add_to_cart_button add-to-cart-loop"><span>Options</span></a></div> 
    </div>

    <div class="wd-buttons wd-pos-r-t color-scheme-light woodmart-buttons">
                            <div class="wd-compare-btn product-compare-button wd-action-btn wd-style-icon wd-compare-icon">
                <a href="https://...linkhere" data-added-text="Compare Products">Buy</a>
            </div>
    <div class="quick-view wd-action-btn wd-style-icon wd-quick-view-icon wd-quick-view-btn">
                <a href="https://...linkhere" class="open-quick-view quick-view-button">quick view</a>
            </div>
                            <div class="wd-wishlist-btn wd-action-btn wd-style-icon wd-wishlist-icon woodmart-wishlist-btn">
                <a class="" href="https://linkhere/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
            </div>
            </div>
                <div class="quick-shop-wrapper wd-fill wd-scroll">
                <div class="quick-shop-close wd-action-btn wd-style-text wd-cross-icon"><a href="#" rel="nofollow noopener">Close</a></div>
                <div class="quick-shop-form wd-scroll-content">
                </div>
            </div>
        </div>
</div>

我笨拙的尝试之一

$html = file_get_contents("https://url-here.goetohere");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'product-grid-item';
$classname = 'product-element-top2';
$classname = 'product-element-top2';
$classname = 'wd-entities-title';
$classname = 'price';
$nodes = $finder->query("//*[contains(@class, '$classname')]");
foreach ($nodes as $node) {
    echo 'here »» ' . htmlentities($node->nodeValue) . '<br>';
}
php 解析 dom html-parsing

评论


答:

3赞 Professor Abronsius 9/7/2022 #1

假设在尝试任何 DOM 处理之前正确获取了 HTML,那么构造一些基本的 XPath 表达式来查找指示的内容是相当简单的。

根据注释,您将在输出中注意到 2 个 div。page contains several of themproduct-grid-item

$html='
    <div class="product-grid-item" data-loop="1">
        <div class="product-element-top">
            <a href="https://...linkhere" class="product-image-link">
                <img width="300" height="300" src="https://image-goes-here.jpg" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail">
            </a>
            <div class="top-information wd-fill">
                <h3 class="wd-entities-title">
                    <a href="https://...linkhere">The title goes here</a>
                </h3>
                <span class="price">
                    <span class="woocommerce-Price-amount amount">
                        <bdi>
                            <span class="woocommerce-Price-currencySymbol">€</span>20,00
                        </bdi>
                    </span>
                </span>
                <div class="wd-add-btn wd-add-btn-replace woodmart-add-btn">
                    <a href="https://...linkhere" data-quantity="1" class="button product_type_variable add_to_cart_button add-to-cart-loop">
                        <span>Options</span>
                    </a>
                </div> 
            </div>

            <div class="wd-buttons wd-pos-r-t color-scheme-light woodmart-buttons">
                <div class="wd-compare-btn product-compare-button wd-action-btn wd-style-icon wd-compare-icon">
                    <a href="https://...linkhere" data-added-text="Compare Products">Buy</a>
                </div>
                <div class="quick-view wd-action-btn wd-style-icon wd-quick-view-icon wd-quick-view-btn">
                    <a href="https://...linkhere" class="open-quick-view quick-view-button">quick view</a>
                </div>
                <div class="wd-wishlist-btn wd-action-btn wd-style-icon wd-wishlist-icon woodmart-wishlist-btn">
                    <a class="" href="https://linkhere/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
                </div>
            </div>
            <div class="quick-shop-wrapper wd-fill wd-scroll">
                <div class="quick-shop-close wd-action-btn wd-style-text wd-cross-icon">
                    <a href="#" rel="nofollow noopener">Close</a>
                </div>
                <div class="quick-shop-form wd-scroll-content"></div>
            </div>
        </div>
    </div>
    
    <div class="product-grid-item" data-loop="1">
        <div class="product-element-top">
            <a href="https://www.example.com/banana" class="product-image-link">
                <img width="300" height="300" src="https://www.example.com/kittykat.jpg" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail">
            </a>
            <div class="top-information wd-fill">
                <h3 class="wd-entities-title">
                    <a href="https://www.example.com/womble">Oh look, another title!</a>
                </h3>
                <span class="price">
                    <span class="woocommerce-Price-amount amount">
                        <bdi>
                            <span class="woocommerce-Price-currencySymbol">€</span>540,00
                        </bdi>
                    </span>
                </span>
                <div class="wd-add-btn wd-add-btn-replace woodmart-add-btn">
                    <a href="https://www.example.com/gorilla" data-quantity="1" class="button product_type_variable add_to_cart_button add-to-cart-loop">
                        <span>Options</span>
                    </a>
                </div> 
            </div>

            <div class="wd-buttons wd-pos-r-t color-scheme-light woodmart-buttons">
                <div class="wd-compare-btn product-compare-button wd-action-btn wd-style-icon wd-compare-icon">
                    <a href="https:www.example.com/buy" data-added-text="Compare Products">Buy</a>
                </div>
                <div class="quick-view wd-action-btn wd-style-icon wd-quick-view-icon wd-quick-view-btn">
                    <a href="https://www.example.com/view" class="open-quick-view quick-view-button">quick view</a>
                </div>
                <div class="wd-wishlist-btn wd-action-btn wd-style-icon wd-wishlist-icon woodmart-wishlist-btn">
                    <a class="" href="https://www.example.com/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
                </div>
            </div>
            <div class="quick-shop-wrapper wd-fill wd-scroll">
                <div class="quick-shop-close wd-action-btn wd-style-text wd-cross-icon">
                    <a href="#" rel="nofollow noopener">Close</a>
                </div>
                <div class="quick-shop-form wd-scroll-content"></div>
            </div>
        </div>
    </div>';

处理下载的 HTML

# set the libxml parameters and create new DOMDocument/XPath objects.
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->loadHTML( $html );
libxml_clear_errors();

$xp=new DOMXPath( $dom );

# some basic XPath expressions
$exprs=(object)array(
    'product-link'      =>  '//a[@class="product-image-link"]',
    'product-img-src'   =>  '//a[@class="product-image-link"]/img',
    'h3-title-text'     =>  '//h3[@class="wd-entities-title"]',
    'price'             =>  '//span[@class="price"]/span/bdi'
);
# find the keys (for convenience) to be used below
$keys=array_keys( get_object_vars( $exprs ) );

# store results here
$res=array();

# loop through all patterns and issue XPath query.
foreach( $exprs as $key => $expr ){
    # add key to output and set as an array.
    $res[ $key ]=[];
    $col=$xp->query( $expr );
    
    # find the data if the query succeeds
    if( $col && $col->length > 0 ){
        foreach( $col as $node ){
            switch( $key ){
                case $keys[0]:$res[$key][]=$node->getAttribute('href');break;
                case $keys[1]:$res[$key][]=$node->getAttribute('src');break;
                case $keys[2]:$res[$key][]=trim($node->textContent);break;
                case $keys[3]:$res[$key][]=trim($node->textContent);break;
            }
        }
    }
}
# show the result or do really interesting things with the data
printf('<pre>%s</pre>',print_r($res,true));

其结果是:

Array
(
    [product-link] => Array
        (
            [0] => https://...linkhere
            [1] => https://www.example.com/banana
        )

    [product-img-src] => Array
        (
            [0] => https://image-goes-here.jpg
            [1] => https://www.example.com/kittykat.jpg
        )

    [h3-title-text] => Array
        (
            [0] => The title goes here
            [1] => Oh look, another title!
        )

    [price] => Array
        (
            [0] => â¬20,00
            [1] => â¬540,00
        )

)

评论

2赞 toscho 9/7/2022
伟大!但。。问我怎么能把它放在相应的 var 上是不是太过分了?例如:[product-link]等。[product-img-src]等。[h3-title-text] 等。[价格]等。PHP[0] »» $var_prod_link_0 = 'the link 0 here'; [1] »» $var_prod_link_1 = 'the link 1 here';[0] »» $var_prod_img_0 = 'the image 0 link here'; [1] »» $var_prod_img_1 = 'the image 1 link here';[0] »» $var_title_text_0 = 'the title 0 here'; [1] »» $var_title_text_1 = 'the title 1 here';[0] »» $var_price_0 = 'the price 0 here'; [1] »» $var_price_1 = 'the price 1 here';
0赞 Professor Abronsius 9/7/2022
遍历数组并生成变量。为什么要创建大量动态变量 - 当您事先不知道每个变量的名称和数量时,使用起来会困难得多?IMO,将生成的输出用于您想到的任何进一步处理要简单得多$res$res