我的 Python3 脚本在 Linux 下解析非 ASCII UTF-8 参数有什么问题？-解网

问：

考虑这个微不足道的 Python3 脚本，它只是按原样打印出的内容，并编码为 UTF-8：sys.argv

import locale
import sys

print("filesystem encoding is:      ", sys.getfilesystemencoding())
print("local preferred encoding is: ", locale.getpreferredencoding())

print("sys.argv is:")
print(sys.argv)

for a in sys.argv:
   print("Next arg is: ", a)
   print("UTF-8 encoding of arg is: ", a.encode())

如果我在 Mac （OSX/Ventura 13.3.1/Intel）上运行此脚本（通过 Python 3.11.4），并使用包含非 ASCII UTF-8 字符的参数，我会得到预期的结果：

$ python /tmp/supportfiles/test.py Joe’s
filesystem encoding is:       utf-8
local preferred encoding is:  UTF-8
sys.argv is:
['./test.py', 'Joe’s']
Next arg is:  ./test.py
UTF-8 encoding of arg is:  b'./test.py'
Next arg is:  Joe’s
UTF-8 encoding of arg is:  b'Joe\xe2\x80\x99s'

但是，如果我在 Linux（Ubuntu 3.19.0、Linux、Python 3.7.0）下使用相同的参数运行相同的命令，则会出现问题，脚本会抛出异常：UnicodeEncodeError

filesystem encoding is:       utf-8         
local preferred encoding is:  UTF-8         
sys.argv is:        
['./test.py', 'Joe\udce2\udc80\udc99s']
Next arg is:  ./test.py
UTF-8 encoding of arg is:  b'./test.py'
Next arg is:  Joe’s         
Traceback (most recent call last):          
  File "./test.py", line 13, in <module>
    print("UTF-8 encoding of arg is: ", a.encode())         
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 3-5: surrogates not allowed

我的问题是，这是 Python 中的错误还是我的 Linux 盒子本地化环境中的错误，还是我做错了什么？

而且，我能做些什么来让我的脚本正确处理所有操作系统上包含非 ASCII 字符的命令行参数吗？

python linux macos unicode utf-8

@jeremy-friesner，假设 Python 3.7 从现在开始：该命令将什么报告为系统语言环境？是否设置了环境变量？utf8_mode准备好了吗？使用命令。 utf8_mode如果设置了 UTF-8，则将报告 UTF-8，而不考虑实际编码。使用“surrogateescape”错误处理程序，这将解释低代理项的存在。也设置了，如果设置了，它有什么值？localePYTHONUTF8sys.flags.utf8_modesys.getfilesystemencodingsys.argvPYTHONIOENCODING

1赞 Andj 7/6/2023

它似乎是一个使用 musl libc 的发行版。我在 musl libc wiki 上最喜欢的一句话：“语言环境支持非常有限，几乎不起作用。我怀疑您正在使用 C 语言环境。

答：

0赞 Jeremy Friesner 7/6/2023 #1

看来我的问题的答案基本上是我的嵌入式 Linux 环境不能正确处理国际化。

为了后代的缘故，以下是我插入到 Python 脚本中的解决方法代码，用于将格式错误的字符串“修复”为 Python 编码器将接受的正确 UTF-8：sys.argv

def IsSurrogateMarker(ordVal):
   highByte = (ordVal & 0xFF00) >> 8
   return ((highByte >= 0xD8) and (highByte <= 0xDF))

# Nasty hack work-around for embedded Linux OS
# mis-encoding UTF-8 argv strings with surrogate bytes
def FixBuggyString(s):
   pendingBytes = []
   r = ''
   for c in s:
      ordVal = ord(c)
      if (IsSurrogateMarker(ordVal)):
         pendingBytes.append(ordVal & 0xFF)  # strip the bogus surrogate-marker
      elif (len(pendingBytes) > 0):
         r += bytearray(pendingBytes).decode()
         pendingBytes = []
         r += c
      else:
         r += c
   if (len(pendingBytes) > 0):
      r = r + bytearray(pendingBytes).decode()
   return r

if __name__ == "__main__":
    # Work-around for buggy UTF-8 in argv strings 
    for i in range(0, len(sys.argv)):
       sys.argv[i] = self.FixBuggyString(sys.argv[i])

    [... remainder of program goes here...]

上一个：HTML 数字引用字符的有效位数

下一个：Python 在读取或写入文件时是否始终使用 UTF-8？（仅使用文件和模式参数打开函数。

我的 Python3 脚本在 Linux 下解析非 ASCII UTF-8 参数有什么问题？

What is the problem with my Python3 script's parsing of non-ASCII UTF-8 arguments under Linux?

评论