可读 HTML 被截断

Readable HTML getting Truncated

提问人:Sunil 提问时间:12/23/2022 最后编辑:RubénSunil 更新时间:1/20/2023 访问量:72

问:

我正在尝试提取本网站的可读性版本 - [https://app.termly.io/document/privacy-policy/93a0d7a9-a628-44b5-9748-4f853bed4112][1]

但是,我注意到可读内容被截断了。我正在使用Mozilla可读性。

  1. 它被截断有什么具体原因吗?
  2. DOMPURIFY能胜任这项工作吗?

代码如下

const Readability = require("@mozilla/readability").Readability; 
const { JSDOM } = require("jsdom");
const { chromium } = require("playwright");
const fs = require('fs');
(async function() {
    try {
        const chromeBrowser = await chromium.launch({ headless: true });
        const context = await chromeBrowser.newContext({
                        ignoreHTTPSErrors: true,
                        userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
        });
        const URL = "https://app.termly.io/document/privacy-policy/93a0d7a9-a628-44b5-9748-4f853bed4112"
        const page = await context.newPage();
       await page.goto(URL, { waitUntil: 'networkidle', timeout: 60000 });
       const content = await page.content();
       console.log(content)
        let dom = new JSDOM(content,{url: URL});
        const article = new Readability(dom.window.document).parse();
        console.log(article)
        fs.writeFileSync('./index.html', article.content);
    } catch (error) {
        console.log(error);
    }
})();
JavaScript 剧作家 可读性 Dompurify

评论


答: 暂无答案