Puppeteer 中的导航超时超出抓取表-解网

问：

我正在尝试从一个网站上抓取表格上的第一个名字，该网站展示了一支篮球队以及该球队的球员姓名和统计数据。当我这样做时，超出了导航超时，这意味着在给定的时间内没有抓取该值，并且在我的客户端上出现了“加载数据错误”。我做错了什么？

仅供参考 - 使用了各种调试语句，这些语句对代码的运行不是必需的。

这是我的 JavaScript 代码：

const puppeteer = require('puppeteer');
const express = require('express');
const app = express();
app.use(express.static("public"));

app.get('/scrape', async (req, res) => {
  let browser;
  try {
    console.log('Attempting to scrape data...');
    browser = await puppeteer.launch();
    const [page] = await browser.pages();

    // Increase the timeout to 60 seconds
    await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });

    // Wait for navigation to complete
    await page.waitForNavigation({ timeout: 60000 });

    const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());

    console.log('Scraping successful:', firstPlayerName);

    res.json({ firstPlayerName });
  } catch (err) {
    console.error('Error during scraping:', err);
    res.status(500).json({ error: 'Internal Server Error' });
  } finally {
    await browser?.close();
  }
});

app.listen(3000, () => {
  console.log('Server is running on http://localhost:3000');
});

这是我的HTML代码：

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" href="styles.css">
</head>
<body>
  <table>
    <p class="robo-header">Robo-Scout </p>
    <p class="robo-subheader"><br> Official Algorithmic Bakstball Scout</p>
    <tr>
      <td>
        <p id="myObjValue"> Loading... </p>
        <script>
          fetch('/scrape') // Send a GET request to the server
            .then(response => {
              if (!response.ok) {
                throw new Error('Network response was not ok');
              }
              return response.json();
            })
            .then(data => {
              console.log(data); // Check what data is received
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = data.firstPlayerName || 'Player name not found';
            })
            .catch(error => {
              console.error(error);
              const myObjValueElement = document.getElementById('myObjValue');
              myObjValueElement.textContent = 'Error loading data'; // Display an error message
            });
        </script>
      </td>
    </tr>
  </table>
</body>
</html>

这是我尝试抓取的表单元格中的代码：

                                    <td class="text-left">

    <a href="/player/maddie-bulbulia/girlsbasketball/season/2022-2023">Maddie Bulbulia</a> <small class="text-muted">Sophomore • G</small>
</td>

我尝试调试代码，通过输出值未提取时输出以及错误来跟踪未提取值的原因。我还尝试将导航超时增加到 60 秒而不是 30 秒，以防万一我的网络移动缓慢，没有变化。

JavaScript 节点.js JSON 网页抓取木偶师

const puppeteer = require("puppeteer"); // ^21.4.1

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: "new"});
  const [page] = await browser.pages();
  await page.setJavaScriptEnabled(false);
  await page.setRequestInterception(true);
  page.on("request", req =>
    req.url() === url ? req.continue() : req.abort()
  );
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const firstPlayerName = await page.$eval(
    "td.text-left a",
    player => player.textContent.trim()
  );
  console.log("Scraping successful:", firstPlayerName);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

再往前走一步，你甚至可能不需要傀儡师。您可以在 Node 18+ 中使用 native 发出请求，并使用 Cheerio 等轻量级库解析响应中所需的数据。fetch

const cheerio = require("cheerio"); // ^1.0.0-rc.12

const url = "<Your URL>";

fetch(url)
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const firstPlayerName = $("td.text-left a").first().text()
    console.log(firstPlayerName); // => Maddie Bulbulia
  })
  .catch(err => console.error(err));

以下是一些快速基准测试。

未优化的傀儡师（仅使用）："domcontentloaded"

real 0m2.974s
user 0m1.004s
sys  0m0.271s

优化 Puppeteer（使用 DCL，加上禁用 JS 和阻塞资源）：

real 0m1.190s
user 0m0.510s
sys  0m0.114s

Fetch/Cheerio：

real 0m0.998s
user 0m0.261s
sys  0m0.049s

如果抓取的数据不经常更改，则可以考虑定期缓存抓取结果，以便可以立即更可靠地将其提供给用户。

您同时使用 await page.goto 和 await page.waitForNavigation，这可能会导致冲突。您可能需要删除 await page.waitForNavigation 行。
```
 await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });
```

如果由于网络速度慢或其他因素导致页面加载时间较长，则可能需要增加$eval函数的超时时间。

 const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', { timeout: 60000 }, player => player.textContent.trim());

在使用 $eval 之前，您可能需要确保尝试选择的元素确实存在于页面上。使用 page.waitForSelector 等待元素出现。

 await page.waitForSelector('tbody tr:first-child .text-left a');
 const firstPlayerName = await page.$eval('tbody tr:first-child .text-left a', player => player.textContent.trim());

如果网站速度较慢或遇到网络问题，增加超时可能还不够。您可以尝试在发出请求之前添加等待时间，以确保页面已完全加载。

await page.waitForTimeout(5000); 
await page.goto('https://highschoolsports.nj.com/school/livingston-newark-academy/girlsbasketball/season/2022-2023/stats', { timeout: 60000 });

此外，您可能希望在 Puppeteer 浏览器中检查控制台输出，以查看是否存在任何错误或消息，这些错误或消息可以更深入地了解问题。您可以在启动 Puppeteer 时启用无头模式（{ headless： false }），以直观地检查抓取过程中页面上发生的情况。

第 1 点是正确的，但也需要 .第 2 点无效，评估系列函数没有超时选项。一般来说，第 3 点是很好的建议，但不适用于 OP 的情况，因为数据已烘焙到 HTML 中。第 4 点，增加超时，通常不是必需的，并且倾向于抑制错误。通常，如果事情在 30 秒或一分钟内不起作用，它们永远不会起作用。第 5 点是一个很好的建议，并且有自己的主线。"domcontentloaded"goto()

上一个：读取json文件并仅打印特定行

下一个：.find（）不是一个函数，当我尝试从 JSON 文件中查找 id 时，它会像这样返回

Puppeteer 中的导航超时超出抓取表

Navigation Timeout Exceeded scraping table in Puppeteer

评论

评论

评论