提问人:Tomas Sedovic 提问时间:3/3/2009 最后编辑:Tomas Sedovic 更新时间:8/3/2023 访问量:4880633
在 Python 3 中将字节转换为字符串
Convert bytes to a string in Python 3
问:
我将外部程序的标准输出捕获到一个对象中:bytes
>>> from subprocess import *
>>> stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n'
我想将其转换为普通的 Python 字符串,以便我可以像这样打印它:
>>> print(stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
如何使用 Python 3 将对象转换为?bytes
str
答:
解码 bytes
对象以生成字符串:
>>> b"abcde".decode("utf-8")
'abcde'
上面的示例假定对象采用 UTF-8 格式,因为它是一种通用编码。但是,您应该使用您的数据实际采用的编码!bytes
评论
"windows-1252"
sys.stdout.encoding
bytes
bytearray
bytes.decode()
对字节字符串进行解码,并将其转换为字符 (Unicode) 字符串。
蟒蛇 3:
encoding = 'utf-8'
b'hello'.decode(encoding)
或
str(b'hello', encoding)
蟒蛇 2:
encoding = 'utf-8'
'hello'.decode(encoding)
或
unicode('hello', encoding)
我想你实际上想要这个:
>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')
Aaron 的回答是正确的,只是你需要知道使用哪种编码。我相信 Windows 使用“windows-1252”。只有当您的内容中有一些不寻常的(非 ASCII)字符时,这才有意义,但这样就会有所作为。
顺便说一句,它确实很重要的事实是 Python 转向对二进制和文本数据使用两种不同类型的原因:它无法在它们之间神奇地转换,因为除非你告诉它,否则它不知道编码!您知道的唯一方法是阅读 Windows 文档(或在此处阅读)。
这会将字节列表连接成一个字符串:
>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'
评论
a.decode('latin-1')
a = bytearray([112, 52, 52])
latin-1
bytes(list_of_integers).decode('ascii')
''.join(map(chr, list_of_integers))
来自 sys — 系统特定的参数和函数:
若要从标准流写入或读取二进制数据,请使用基础二进制缓冲区。例如,要将字节写入 stdout,请使用 .sys.stdout.buffer.write(b'abc')
评论
bytes
将 universal_newlines 设置为 True,即
command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
评论
如果您不知道编码,那么要以 Python 3 和 Python 2 兼容的方式将二进制输入读取到字符串中,请使用古老的 MS-DOS CP437 编码:
PY3K = sys.version_info >= (3, 0)
lines = []
for line in stream:
if not PY3K:
lines.append(line)
else:
lines.append(line.decode('cp437'))
由于编码未知,因此将非英语符号转换为 的字符(英语字符不会被翻译,因为它们在大多数单字节编码和 UTF-8 中匹配)。cp437
将任意二进制输入解码为 UTF-8 是不安全的,因为您可能会得到以下结果:
>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte
这同样适用于 ,这在 Python 2 中很流行(默认?请参阅代码页布局中的缺失点 - 这是 Python 因臭名昭著而窒息的地方。latin-1
ordinal not in range
更新20150604:有传言说 Python 3 具有将内容编码为二进制数据而不会丢失数据和崩溃的错误策略,但它需要转换测试来验证性能和可靠性。surrogateescape
[binary] -> [str] -> [binary]
更新20170116:感谢 Nearoo 的评论 - 还可以使用错误处理程序对所有未知字节进行斜杠转义。这仅适用于 Python 3,因此即使使用此解决方法,您仍然会从不同的 Python 版本获得不一致的输出:backslashreplace
PY3K = sys.version_info >= (3, 0)
lines = []
for line in stream:
if not PY3K:
lines.append(line)
else:
lines.append(line.decode('utf-8', 'backslashreplace'))
有关详细信息,请参阅 Python 的 Unicode 支持。
更新20170119:我决定实现适用于 Python 2 和 Python 3 的斜杠转义解码。它应该比解决方案慢,但它应该在每个 Python 版本上产生相同的结果。cp437
# --- preparation
import codecs
def slashescape(err):
""" codecs error handler. err is UnicodeDecode instance. return
a tuple with a replacement for the unencodable part of the input
and a position where encoding should continue"""
#print err, dir(err), err.start, err.end, err.object[:err.start]
thebyte = err.object[err.start:err.end]
repl = u'\\x'+hex(ord(thebyte))[2:]
return (repl, err.end)
codecs.register_error('slashescape', slashescape)
# --- processing
stream = [b'\x80abc']
lines = []
for line in stream:
lines.append(line.decode('utf-8', 'slashescape'))
评论
bytes(range(256)).decode('latin-1')
虽然 @Aaron Maenpaa 的回答很有效,但一位用户最近问道:
还有比这更简单的方法吗?'fhand.read().decode(“ASCII”)' [...]太长了!
您可以使用:
command_stdout.decode()
decode()
有一个标准参数:
codecs.decode(obj, encoding='utf-8', errors='strict')
在 Python 3 中,默认编码是 ,因此您可以直接使用:"utf-8"
b'hello'.decode()
这相当于
b'hello'.decode(encoding="utf-8")
另一方面,在 Python 2 中,编码默认为默认字符串编码。因此,您应该使用:
b'hello'.decode(encoding)
您想要的编码在哪里。encoding
注意:Python 2.7 中添加了对关键字参数的支持。
要将字节序列解释为文本,您必须知道 对应字符编码:
unicode_text = bytestring.decode(character_encoding)
例:
>>> b'\xc2\xb5'.decode('utf-8')
'µ'
ls
命令可能会生成无法解释为文本的输出。文件名
在 Unix 上可以是除斜杠和零之外的任何字节序列:b'/'
b'\0'
>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()
尝试使用 utf-8 编码解码此类字节汤会引发 .UnicodeDecodeError
情况可能会更糟。如果您使用错误的不兼容编码,解码可能会静默失败并产生 mojibake:
>>> '—'.encode('utf-8').decode('cp1252')
'—'
数据已损坏,但程序仍未意识到失败 已经发生。
通常,要使用的字符编码不会嵌入到字节序列本身中。您必须在带外传达此信息。某些结果比其他结果更有可能,因此存在可以猜测字符编码的模块。单个 Python 脚本可以在不同位置使用多个字符编码。chardet
ls
输出可以使用函数转换为 Python 字符串,即使对于不可解码也是如此
文件名(它使用 和 错误处理程序
Unix的):os.fsdecode()
sys.getfilesystemencoding()
surrogateescape
import os
import subprocess
output = os.fsdecode(subprocess.check_output('ls'))
要获取原始字节,可以使用 .os.fsencode()
如果传递参数,则用于解码字节,例如,它可以在 Windows 上。universal_newlines=True
subprocess
locale.getpreferredencoding(False)
cp1252
要即时解码字节流,io.TextIOWrapper()
可以使用:example。
不同的命令可能对其使用不同的字符编码
输出例如,内部命令 () 可以使用 CP437。解码其
输出,您可以显式传递编码 (Python 3.6+):dir
cmd
output = subprocess.check_output('dir', shell=True, encoding='cp437')
文件名可能与(使用 Windows
Unicode API),例如,可以替换为 —Python 的
cp437 编解码器映射到控制字符 U+0014 而不是
U+00B6 (¶)。若要支持包含任意 Unicode 字符的文件名,请参阅将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串os.listdir()
'\xb6'
'\x14'
b'\x14'
对于 Python 3,这是一种更安全的 Pythonic 方法,可以从 :byte
string
def byte_to_str(bytes_or_str):
if isinstance(bytes_or_str, bytes): # Check if it's in bytes
print(bytes_or_str.decode('utf-8'))
else:
print("Object not of byte type")
byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2\n')
输出:
total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
如果您应该通过尝试获得以下内容:decode()
AttributeError:“str”对象没有属性“decode”
您还可以直接在强制转换中指定编码类型:
>>> my_byte_str
b'Hello World'
>>> str(my_byte_str, 'utf-8')
'Hello World'
当使用来自 Windows 系统的数据(带有行尾)时,我的答案是\r\n
String = Bytes.decode("utf-8").replace("\r\n", "\n")
为什么?尝试使用多行输入 .txt:
Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)
All your line endings will be doubled (to ), leading to extra empty lines. Python's text-read functions usually normalize line endings so that strings use only . If you receive binary data from a Windows system, Python does not have a chance to do that. Thus,\r\r\n
\n
Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)
will replicate your original file.
Since this question is actually asking about output, you have more direct approaches available. The most modern would be using subprocess.check_output
and passing (Python 3.7+) to automatically decode stdout using the system default coding:subprocess
text=True
text = subprocess.check_output(["ls", "-l"], text=True)
For Python 3.6, accepts an encoding keyword:Popen
>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt
The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:
>>> b'abcde'.decode()
'abcde'
With no argument, sys.getdefaultencoding()
will be used. If your data is not , then you must specify the encoding explicitly in the decode
call:sys.getdefaultencoding()
>>> b'caf\xe9'.decode('cp1250')
'café'
def toString(string):
try:
return v.decode("utf-8")
except ValueError:
return string
b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))
评论
If you want to convert any bytes, not just string converted to bytes:
with open("bytesfile", "rb") as infile:
str = base64.b85encode(imageFile.read())
with open("bytesfile", "rb") as infile:
str2 = json.dumps(list(infile.read()))
This is not very efficient, however. It will turn a 2 MB picture into 9 MB.
For your specific case of "run a shell command and get its output as text instead of bytes", on Python 3.7, you should use subprocess.run
and pass in (as well as to capture the output)text=True
capture_output=True
command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout # is a `str` containing your program's stdout
text
used to be called , and was changed (well, aliased) in Python 3.7. If you want to support Python versions before 3.7, pass in instead of universal_newlines
universal_newlines=True
text=True
Try this:
bytes.fromhex('c3a9').decode('utf-8')
Try using this one; this function will ignore all the non-character sets (like UTF-8) binaries and return a clean string. It is tested for Python 3.6 and above.
def bin2str(text, encoding = 'utf-8'):
"""Converts a binary to Unicode string by removing all non Unicode char
text: binary string to work on
encoding: output encoding *utf-8"""
return text.decode(encoding, 'ignore')
Here, the function will take the binary and decode it (converts binary data to characters using the Python predefined character set and the argument ignores all non-character set data from your binary and finally returns your desired value.ignore
string
If you are not sure about the encoding, use to get the default encoding of your device.sys.getdefaultencoding()
If you have had this error:
utf-8 codec can't decode byte 0x8a,
then it is better to use the following code to convert bytes to a string:
bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore")
We can decode the bytes object to produce a string using .
For documentation see bytes.decode.bytes.decode(encoding='utf-8', errors='strict')
Python 3 example:
byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))
Output:
Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>
Note: In Python 3, by default the encoding type is UTF-8. So, can be also written as <byte_string>.decode("utf-8")
<byte_string>.decode()
Bytes
m=b'This is bytes'
Converting to string
Method 1
m.decode("utf-8")
或
m.decode()
Method 2
import codecs
codecs.decode(m,encoding="utf-8")
或
import codecs
codecs.decode(m)
Method 3
str(m,encoding="utf-8")
或
str(m)[2:-1]
Result
'This is bytes'
A potential answer:
#input string
istring = b'pomegranite'
# output string
ostring = str(istring)
评论
"b'pomegranite'"
One of the best ways to convert to string without caring about any encoding type is as follows -
import json
b_string = b'test string'
string = b_string.decode(
json.detect_encoding(b_string) # detect_encoding - used to detect encoding
)
print(string)
Here, we used method to detect the encoding.json.detect_encoding
评论
str(text_bytes)
str(text_bytes)
text_bytes.decode('cp1250
text_bytes.decode('utf-8')
str
utf-8
var = var.decode('utf-8')
unicode_text = str(bytestring, character_encoding)
unicode_text = bytestring.decode(character_encoding)
str(bytes_obj)
bytes_obj
str(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶'
str(b'\xb6') == "b'\\xb6'" == repr(b'\xb6') != '¶'
text=True
subprocess.run()
.Popen()
encoding="utf-8"