使用批处理脚本 curl 获取 UTF-8 JSON 文件而不是 UTF-16 LE BOM

Getting UTF-8 JSON file instead of UTF-16 LE BOM using batch script curl

提问人:cyborg1234 提问时间:7/28/2023 更新时间:10/13/2023 访问量:153

问:

我有一个批处理脚本,它使用以下命令执行 curl 命令:

powershell -命令 “curl.exe -k -s -x get -h 'Version: 16.0' -H 'Accept: application/json' 'https://IntendedURL' > result.json”。

然后,我将result.json用作另一个程序的输入文件。

我面临的问题是运行批处理脚本会导致格式为 UTF-16 LE BOM 的result.json文件。使用 UTF-16 LE BOM 格式,甚至无法使用普通浏览器直接打开和观察 .json 文件。它有异常“SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data”。

我做了一些搜索,得到的答案是 curl 不应该导致保存 BOM 编码。尽管如此,我最终还是得到了 result.json 是 BOM 编码。 如果可以对批处理脚本执行某些操作以使其在没有 BOM 的情况下输出,我们将不胜感激。

对于我问题的第二部分,我可以手动复制result.json的内容并将其保存在新的文本文档中,并特别选择 UTF-8 格式。我也四处搜索,找到了一些解决方案,如下所示:批处理脚本从文件中删除 BOM(),但删除 BOM 的解决方案脚本似乎相当复杂。

因此,任何帮助删除 BOM 编码而无需手动复制内容并将其保存为特定所需的格式也是值得赞赏的。

此致敬意 机器人1234

PowerShell curl UTF-8 UTF-16 字节顺序标记

评论

1赞 jdweng 7/28/2023
阅读以下内容:stackoverflow.com/questions/56724398/...
0赞 mklement0 7/29/2023
@jdweng,链接的帖子是关于 Python 的字符编码选择,这里不适用。在这里,Power Shell的字符编码行为是问题所在。
0赞 jdweng 7/29/2023
@mklement0 :问题出在 Python 上,链接中将 stdeout 保存到 utf-8 文件的解决方案似乎是合理的。
0赞 mklement0 7/29/2023
@jdweng,如果您尝试使用 PowerShell 保存到 UTF-8 文件(这就是本问题的内容),您最终会遇到 (a) 字符编码不匹配和误解的非 ASCII 字符,以及 (b) 在 Windows PowerShell 中使用带有 BOM 的 UTF-8 文件,这两者都必须避免。链接帖子中的任何内容都对这些问题没有任何帮助。

答:

0赞 mklement0 7/29/2023 #1

注意:

  • 请参阅底部,了解通过直接从批处理文件调用来避免该问题的方法。curl.exe

在 PowerShell 中,(实际上)是 Out-File cmdlet 的别名,其在 Windows PowerShell 中的默认值为 UTF-16LE(在 PowerShell (Core) 7+ 中,它现在是所有 cmdlet 中更明智的 (无 BOM) UTF-8)。>

有两个选项(剧透警报:选择第二个):

  • 原则上,您可以显式使用(或者最好与文本输入一起使用 Set-Content),在这种情况下,您可以使用其参数来控制输出字符编码。然而:Out-File-Encoding

    • Windows PowerShell 中,编码 UTF8 总是使用 BOM 创建 UTF-8 文件。除了直接使用 .NET API 之外,Windows PowerShell 中创建无 BOM 的 UTF-8 文件的唯一方法是使用 New-Item - 请参阅此答案

    • 即使清除该障碍通常也是不够的,以下内容也会影响 PowerShell (Core) 至 v7.3.x(但在 v7.4+ 中不再是问题 - 请参阅此答案):

      • When PowerShell captures output from external programs it currently invariably decodes them into .NET strings first - even when redirecting to a file - using the character encoding stored in , and then re-encodes on output, based on the specified or default character encoding of the cmdlet used.[Console]::OutputEncoding

      • If that encoding doesn't match the character encoding of what outputs, the output will be misinterpreted - and given that defaults to the legacy system locale's OEM code page, misinterpretation will occur if the output is UTF-8-encoded, so - unless you've explicitly configured your system to use UTF-8 system-wide (see this answer) - you'll need to (temporarily) set [Console]::OutputEncoding = [Text.UTF8Encoding]::new() before the curl.exe call.curl.exe[Console]::OutputEncoding

  • Preferably, let curl.exe itself write the output file, i.e. replace with : this will save the output as-is to the target file.> result.json-o result.json


Taking a step back:

  • You can make your curl.exe call directly from your batch file, which avoids the character-encoding problems.

  • cmd.exe's operator, unlike PowerShell's is a raw byte conduit, so no data corruption should occur - that said, you're free to use instead.>-o result.json

  • When calling from (a batch file), only quoting is supported, so the strings have been transformed into ones below.cmd.exe"..."'...'"..."

:: Directly from your batch file, using "..." quoting:
curl.exe -k -S -X GET -H "Version: 16.0" -H "Accept: application/json" "https://IntendedURL" > result.json

评论

0赞 cyborg1234 8/1/2023
Hi @mklement0, I have tried the method of changing ">" to "-o" and the method of calling curl.exe directly without the "powershell -Command". Both methods result in result.json of the format UTF-8 which seems to be what I want to achieve. However, when I open this UTF-8 result.json file in a normal browser, the exception of "SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data" still shows. And when I manually save the content of the file into new UTF-8 text file, the exception occurs too, which differs from when I saved the content from UTF-16 LE BOM manually.
0赞 mklement0 8/1/2023
@cyborg1234, so you now have a UTF-8 file without a BOM, but the file doesn't contain valid JSON, due to characters at the beginning? If you inspect the file visually in a text editor, does it look like valid JSON? Is able to parse it? Here's an example of content that would break JSON parsing: ConvertFrom-Json'x' | ConvertFrom-Json