提问人:Antonio Gomez Alvarado 提问时间:3/13/2018 更新时间:10/5/2023 访问量:68665
如何使用 headless: true 下载带有 Puppeteer 的文件?
How to download file with puppeteer using headless: true?
问:
我一直在运行以下代码以便从网站下载文件:csv
http://niftyindices.com/resources/holiday-calendar
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('http://niftyindices.com/resources/holiday-calendar');
await page._client.send('Page.setDownloadBehavior', {behavior: 'allow',
downloadPath: '/tmp'})
await page.click('#exportholidaycalender');
await page.waitFor(5000);
await browser.close();
})();
有了它,它会将文件下载到 .用它不起作用。headless: false
/Users/user/Downloads
headless: true
我正在使用 puppeteer 版本在 macOS Sierra (MacBook Pro) 上运行它,该版本将 Chromium 版本拉入目录并使用并设置它。1.1.1
66.0.3347.0
.local-chromium/
npm init
npm i --save puppeteer
知道怎么了吗?
提前感谢您的时间和帮助,
答:
此页面通过创建逗号分隔字符串并通过设置数据类型来强制浏览器下载它来下载 csv
let uri = "data:text/csv;charset=utf-8," + encodeURIComponent(content);
window.open(uri, "Some CSV");
这在 chrome 上会打开一个新选项卡。
您可以点击此事件并将内容物理下载到文件中。不确定这是否是最好的方法,但效果很好。
const browser = await puppeteer.launch({
headless: true
});
browser.on('targetcreated', async (target) => {
let s = target.url();
//the test opens an about:blank to start - ignore this
if (s == 'about:blank') {
return;
}
//unencode the characters after removing the content type
s = s.replace("data:text/csv;charset=utf-8,", "");
//clean up string by unencoding the %xx
...
fs.writeFile("/tmp/download.csv", s, function(err) {
if(err) {
console.log(err);
return;
}
console.log("The file was saved!");
});
});
const page = await browser.newPage();
.. open link ...
.. click on download link ..
评论
page._client
我需要从登录名后面下载一个文件,该文件由 Puppeteer 处理。 未被触发。最后,我从 Puppeteer 实例复制了 cookie 后,下载了 。targetcreated
request
在这种情况下,我正在流式传输文件,但您也可以很容易地保存它。
res.writeHead(200, {
"Content-Type": 'application/octet-stream',
"Content-Disposition": `attachment; filename=secretfile.jpg`
});
let cookies = await page.cookies();
let jar = request.jar();
for (let cookie of cookies) {
jar.setCookie(`${cookie.name}=${cookie.value}`, "http://secretsite.com");
}
try {
var response = await request({ url: "http://secretsite.com/secretfile.jpg", jar }).pipe(res);
} catch(err) {
console.trace(err);
return res.send({ status: "error", message: err });
}
昨天我花了几个小时仔细研究这个线程和 Stack Overflow,试图弄清楚如何让 Puppeteer 通过在经过身份验证的会话中以无头模式单击下载链接来下载 csv 文件。这里接受的答案在我的情况下不起作用,因为下载不会触发,并且下一个答案,无论出于何种原因,都没有保留经过身份验证的会话。这篇文章挽救了这一天。总之。希望这对其他人有所帮助。targetcreated
fetch
const res = await this.page.evaluate(() =>
{
return fetch('https://example.com/path/to/file.csv', {
method: 'GET',
credentials: 'include'
}).then(r => r.text());
});
评论
false
fetch
问题是浏览器在下载完成之前关闭。
您可以从响应中获取文件大小和文件名,然后使用监视脚本从下载的文件中检查文件大小,以便关闭浏览器。
下面是一个示例:
const filename = "set this with some regex in response";
const dir = "watch folder or file";
// Download and wait for download
await Promise.all([
page.click('#DownloadFile'),
// Event on all responses
page.on('response', response => {
// If response has a file on it
if (response._headers['content-disposition'] === `attachment;filename=${filename}`) {
// Get the size
console.log('Size del header: ', response._headers['content-length']);
// Watch event on download folder or file
fs.watchFile(dir, function (curr, prev) {
// If current size eq to size from response then close
if (parseInt(curr.size) === parseInt(response._headers['content-length'])) {
browser.close();
this.close();
}
});
}
})
]);
即使可以改进响应搜索的方式,但我希望您会发现这很有用。
评论
page.on('response', response => {
Promise.all
page.on
对于这个问题,我有另一种解决方案,因为这里的答案都不适合我。
我需要登录一个网站,并下载一些 .csv 报告。Headed很好,无论我尝试什么,Headless都失败了。查看网络错误,下载已中止,但我无法(快速)确定原因。
因此,我拦截了请求并使用 node-fetch 在 puppeteer 之外发出请求。这需要复制获取选项、正文、标头并添加访问 cookie。
祝你好运。
评论
我找到了一种方法来等待浏览器功能下载文件。这个想法是等待谓词的响应。就我而言,URL 以“/data”结尾。
我只是不喜欢将文件内容加载到缓冲区中。
await page._client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: download_path,
});
await frame.focus(report_download_selector);
await Promise.all([
page.waitForResponse(r => r.url().endsWith('/data')),
page.keyboard.press('Enter'),
]);
评论
setDownloadBehavior
适用于模式,文件最终会下载,但在完成后会抛出异常,因此对于我的情况,一个简单的包装器有助于忘记这个问题并完成工作:headless: true
const fs = require('fs');
function DownloadMgr(page, downloaddPath) {
if(!fs.existsSync(downloaddPath)){
fs.mkdirSync(downloaddPath);
}
var init = page.target().createCDPSession().then((client) => {
return client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: downloaddPath})
});
this.download = async function(url) {
await init;
try{
await page.goto(url);
}catch(e){}
return Promise.resolve();
}
}
var path = require('path');
var DownloadMgr = require('./classes/DownloadMgr');
var downloadMgr = new DownloadMgr(page, path.resolve('./tmp'));
await downloadMgr.download('http://file.csv');
评论
我发现的一种方法是使用方法。在无头或addScriptTag
False
True
使用此功能,可以下载任何类型的网页。现在考虑到该网页打开了一个链接,如下所示:https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4
该网页,表示将使用以下脚本下载 mp4 文件;
await page.addScriptTag({'content':'''
function fileName(){
link = document.location.href
return link.substring(link.lastIndexOf('/')+1);
}
async function save() {
bl = await fetch(document.location.href).then(r => r.blob());
var a = document.createElement("a");
a.href = URL.createObjectURL(bl);
a.download = fileName();
a.hidden = true;
document.body.appendChild(a);
a.innerHTML = "download";
a.click();
}
save()
'''
})
我有一个更困难的变体,使用 Puppeteer Sharp。在下载开始之前,我需要同时设置标头和 Cookie。
从本质上讲,在单击按钮之前,我必须处理多个响应并在下载时处理单个响应。一旦我得到该特定响应,我就必须为远程服务器附加标头和 cookie,以便在响应中发送可下载的数据。
await using (var browser = await Puppeteer.LaunchAsync(new LaunchOptions { Headless = true, Product = Product.Chrome }))
await using (var page = await browser.NewPageAsync())
{
...
// Handle multiple responses and process the Download
page.Response += async (sender, responseCreatedEventArgs) =>
{
if (!responseCreatedEventArgs.Response.Headers.ContainsKey("Content-Type"))
return;
// Handle the response with the Excel download
var contentType = responseCreatedEventArgs.Response.Headers["Content-Type"];
if (contentType.Contains("application/vnd.ms-excel"))
{
string getUrl = responseCreatedEventArgs.Response.Url;
// Add the cookies to a container for the upcoming Download GET request
var pageCookies = await page.GetCookiesAsync();
var cookieContainer = BuildCookieContainer(pageCookies);
await DownloadFileRequiringHeadersAndCookies(getUrl, fullPath, cookieContainer, cancellationToken);
}
};
await page.ClickAsync("button[id^='next']");
// NEED THIS TIMEOUT TO KEEP THE BROWSER OPEN WHILE THE FILE IS DOWNLOADING!
await page.WaitForTimeoutAsync(1000 * configs.DownloadDurationEstimateInSeconds);
}
填充 Cookie 容器,如下所示:
private CookieContainer BuildCookieContainer(IEnumerable<CookieParam> cookies)
{
var cookieContainer = new CookieContainer();
foreach (var cookie in cookies)
{
cookieContainer.Add(new Cookie(cookie.Name, cookie.Value, cookie.Path, cookie.Domain));
}
return cookieContainer;
}
DownloadFileRequiringHeadersAndCookies 的详细信息在这里。如果您下载文件的需求更简单,则可以使用此线程中提到的其他方法或链接线程。
我有同样的问题,但它可以解决比你介意的更简单:)
我刚刚修复了您的代码:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('http://niftyindices.com/resources/holiday-calendar');
const client = await page.target().createCDPSession();
await client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: '/tmp'})
await page.click('#exportholidaycalender');
await page.waitFor(5000);
await browser.close();
})();
详:
创建对象客户端:const client = await page.target().createCDPSession();
改变page._client
-> client
我希望这种方式对你来说是最简单的。
如前所述,请尝试此代码
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto('http://niftyindices.com/resources/holiday-calendar');
const client = await page.target().createCDPSession();
await client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: '/tmp'})
await page.click('#exportholidaycalender');
await page.waitFor(5000);
await browser.close();
})();
评论
--enable-logging
browser
[0313/104723.451228:VERBOSE1:navigator_impl.cc(200)] Failed Provisional Load: data:application/csv;charset=utf-8,%22SR.%20NO.... error_description: , showing_repost_interstitial: 0, frame_id: 4