提问人:Sunil 提问时间:12/23/2022 最后编辑:RubénSunil 更新时间:1/20/2023 访问量:72
可读 HTML 被截断
Readable HTML getting Truncated
问:
我正在尝试提取本网站的可读性版本 - [https://app.termly.io/document/privacy-policy/93a0d7a9-a628-44b5-9748-4f853bed4112][1]
但是,我注意到可读内容被截断了。我正在使用Mozilla可读性。
- 它被截断有什么具体原因吗?
- DOMPURIFY能胜任这项工作吗?
代码如下
const Readability = require("@mozilla/readability").Readability;
const { JSDOM } = require("jsdom");
const { chromium } = require("playwright");
const fs = require('fs');
(async function() {
try {
const chromeBrowser = await chromium.launch({ headless: true });
const context = await chromeBrowser.newContext({
ignoreHTTPSErrors: true,
userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
});
const URL = "https://app.termly.io/document/privacy-policy/93a0d7a9-a628-44b5-9748-4f853bed4112"
const page = await context.newPage();
await page.goto(URL, { waitUntil: 'networkidle', timeout: 60000 });
const content = await page.content();
console.log(content)
let dom = new JSDOM(content,{url: URL});
const article = new Readability(dom.window.document).parse();
console.log(article)
fs.writeFileSync('./index.html', article.content);
} catch (error) {
console.log(error);
}
})();
答: 暂无答案
评论