提问人:Jeremy Friesner 提问时间:7/2/2023 最后编辑:Jeremy Friesner 更新时间:7/6/2023 访问量:83
我的 Python3 脚本在 Linux 下解析非 ASCII UTF-8 参数有什么问题?
What is the problem with my Python3 script's parsing of non-ASCII UTF-8 arguments under Linux?
问:
考虑这个微不足道的 Python3 脚本,它只是按原样打印出 的内容,并编码为 UTF-8:sys.argv
import locale
import sys
print("filesystem encoding is: ", sys.getfilesystemencoding())
print("local preferred encoding is: ", locale.getpreferredencoding())
print("sys.argv is:")
print(sys.argv)
for a in sys.argv:
print("Next arg is: ", a)
print("UTF-8 encoding of arg is: ", a.encode())
如果我在 Mac (OSX/Ventura 13.3.1/Intel) 上运行此脚本(通过 Python 3.11.4),并使用包含非 ASCII UTF-8 字符的参数,我会得到预期的结果:
$ python /tmp/supportfiles/test.py Joe’s
filesystem encoding is: utf-8
local preferred encoding is: UTF-8
sys.argv is:
['./test.py', 'Joe’s']
Next arg is: ./test.py
UTF-8 encoding of arg is: b'./test.py'
Next arg is: Joe’s
UTF-8 encoding of arg is: b'Joe\xe2\x80\x99s'
但是,如果我在 Linux(Ubuntu 3.19.0、Linux、Python 3.7.0)下使用相同的参数运行相同的命令,则会出现问题,脚本会抛出异常:UnicodeEncodeError
filesystem encoding is: utf-8
local preferred encoding is: UTF-8
sys.argv is:
['./test.py', 'Joe\udce2\udc80\udc99s']
Next arg is: ./test.py
UTF-8 encoding of arg is: b'./test.py'
Next arg is: Joe’s
Traceback (most recent call last):
File "./test.py", line 13, in <module>
print("UTF-8 encoding of arg is: ", a.encode())
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 3-5: surrogates not allowed
我的问题是,这是 Python 中的错误还是我的 Linux 盒子本地化环境中的错误,还是我做错了什么?
而且,我能做些什么来让我的脚本正确处理所有操作系统上包含非 ASCII 字符的命令行参数吗?
答:
0赞
Jeremy Friesner
7/6/2023
#1
看来我的问题的答案基本上是我的嵌入式 Linux 环境不能正确处理国际化。
为了后代的缘故,以下是我插入到 Python 脚本中的解决方法代码,用于将格式错误的字符串“修复”为 Python 编码器将接受的正确 UTF-8:sys.argv
def IsSurrogateMarker(ordVal):
highByte = (ordVal & 0xFF00) >> 8
return ((highByte >= 0xD8) and (highByte <= 0xDF))
# Nasty hack work-around for embedded Linux OS
# mis-encoding UTF-8 argv strings with surrogate bytes
def FixBuggyString(s):
pendingBytes = []
r = ''
for c in s:
ordVal = ord(c)
if (IsSurrogateMarker(ordVal)):
pendingBytes.append(ordVal & 0xFF) # strip the bogus surrogate-marker
elif (len(pendingBytes) > 0):
r += bytearray(pendingBytes).decode()
pendingBytes = []
r += c
else:
r += c
if (len(pendingBytes) > 0):
r = r + bytearray(pendingBytes).decode()
return r
if __name__ == "__main__":
# Work-around for buggy UTF-8 in argv strings
for i in range(0, len(sys.argv)):
sys.argv[i] = self.FixBuggyString(sys.argv[i])
[... remainder of program goes here...]
评论
\udce2\udc80\udc99
locale
python -V
locale
PYTHONUTF8
sys.flags.utf8_mode
sys.getfilesystemencoding
sys.argv
PYTHONIOENCODING