在 Python 3 中将字节转换为字符串

Convert bytes to a string in Python 3

提问人:Tomas Sedovic 提问时间:3/3/2009 最后编辑:Tomas Sedovic 更新时间:8/3/2023 访问量:4880633

问:

我将外部程序的标准输出捕获到一个对象中:bytes

>>> from subprocess import *
>>> stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

我想将其转换为普通的 Python 字符串,以便我可以像这样打印它:

>>> print(stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

如何使用 Python 3 将对象转换为?bytesstr


请参阅在 Python 3 中将字符串转换为字节的最佳方法?

字符串 python-3.x

评论

156赞 Charlie Parker 3/15/2019
为什么不起作用?这对我来说似乎很奇怪。str(text_bytes)
73赞 Craig Anderson 4/1/2019
@CharlieParker 因为无法指定编码。根据 text_bytes 中的内容,)' 可能会生成与 截然不同的字符串。str(text_bytes)text_bytes.decode('cp1250text_bytes.decode('utf-8')
17赞 Charlie Parker 4/23/2019
所以函数不再转换为真正的字符串。出于某种原因,我不得不明确地说编码,我懒得通读为什么。只需将其转换为并查看您的代码是否有效即可。例如strutf-8var = var.decode('utf-8')
16赞 jfs 4/12/2020
@CraigAnderson:在 Python 3 上按预期工作。虽然更可取的是避免混淆,但仅生成文本表示而不是将其解码为文本:和unicode_text = str(bytestring, character_encoding)unicode_text = bytestring.decode(character_encoding)str(bytes_obj)bytes_objstr(b'\xb6', 'cp1252') == b'\xb6'.decode('cp1252') == '¶'str(b'\xb6') == "b'\\xb6'" == repr(b'\xb6') != '¶'
3赞 David Gilbertson 9/13/2022
此外,您可以传递给 or,然后您将返回一个字符串,无需转换字节。或者指定任一函数。text=Truesubprocess.run().Popen()encoding="utf-8"

答:

5644赞 Aaron Maenpaa 3/3/2009 #1

解码 bytes 对象以生成字符串:

>>> b"abcde".decode("utf-8") 
'abcde'

上面的示例假定对象采用 UTF-8 格式,因为它是一种通用编码。但是,您应该使用您的数据实际采用的编码!bytes

评论

1赞 mcherm 7/19/2011
是的,但鉴于这是 Windows 命令的输出,它不应该使用“.decode('windows-1252')”吗?
99赞 nikow 1/3/2012
使用也不可靠(例如,对于其他语言版本的 Windows),最好使用吗?"windows-1252"sys.stdout.encoding
23赞 Wookie88 4/16/2013
也许这会进一步帮助某人:有时您使用字节数组进行 e.x. TCP 通信。如果要将字节数组转换为字符串,截断尾部“\x00”字符,以下答案是不够的。然后使用 b'example\x00\x00'.decode('utf-8').strip('\x00')。
1赞 Gabriel Staples 3/25/2021
官方文档:对于所有 和 操作(可以在这些对象上调用的方法),请参阅此处:docs.python.org/3/library/stdtypes.html#bytes-methods。具体而言,请参阅此处:docs.python.org/3/library/stdtypes.html#bytes.decodebytesbytearraybytes.decode()
416赞 dF. 3/3/2009 #2

对字节字符串进行解码,并将其转换为字符 (Unicode) 字符串。


蟒蛇 3:

encoding = 'utf-8'
b'hello'.decode(encoding)

str(b'hello', encoding)

蟒蛇 2:

encoding = 'utf-8'
'hello'.decode(encoding)

unicode('hello', encoding)
49赞 mcherm 7/19/2011 #3

我想你实际上想要这个:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron 的回答是正确的,只是你需要知道使用哪种编码。我相信 Windows 使用“windows-1252”。只有当您的内容中有一些不寻常的(非 ASCII)字符时,这才有意义,但这样就会有所作为。

顺便说一句,它确实很重要的事实是 Python 转向对二进制和文本数据使用两种不同类型的原因:它无法在它们之间神奇地转换,因为除非你告诉它,否则它不知道编码!您知道的唯一方法是阅读 Windows 文档(或在此处阅读)。

264赞 Sisso 8/22/2012 #4

这会将字节列表连接成一个字符串:

>>> bytes_data = [112, 52, 52]
>>> "".join(map(chr, bytes_data))
'p44'

评论

8赞 Martijn Pieters 9/2/2014
@leetNightshade:然而,它的效率非常低。如果你有一个字节数组,你只需要解码。
11赞 jfs 11/16/2016
@Sasszem:这种方法是一种的表达方式:where(“没有纯文本这样的东西”。如果您已经设法将字节转换为文本字符串,那么您使用了某种编码——在本例中)a.decode('latin-1')a = bytearray([112, 52, 52])latin-1
6赞 Martijn Pieters 7/3/2018
@leetNightshade:为了完整起见:比 Python 3.6 快 1/3 左右。bytes(list_of_integers).decode('ascii')''.join(map(chr, list_of_integers))
4赞 Zhichang Yu 1/11/2014 #5

来自 sys — 系统特定的参数和函数

若要从标准流写入或读取二进制数据,请使用基础二进制缓冲区。例如,要将字节写入 stdout,请使用 .sys.stdout.buffer.write(b'abc')

评论

4赞 Martijn Pieters 9/2/2014
子进程的管道已经是二进制缓冲区。您的答案无法解决如何从结果值中获取字符串值的问题。bytes
39赞 ContextSwitch 1/21/2014 #6

将 universal_newlines 设置为 True,即

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

评论

3赞 1/14/2019
在 3.7 上,您可以(并且应该)代替 .text=Trueuniversal_newlines=True
129赞 anatoly techtonik 12/17/2014 #7

如果您不知道编码,那么要以 Python 3 和 Python 2 兼容的方式将二进制输入读取到字符串中,请使用古老的 MS-DOS CP437 编码:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

由于编码未知,因此将非英语符号转换为 的字符(英语字符不会被翻译,因为它们在大多数单字节编码和 UTF-8 中匹配)。cp437

将任意二进制输入解码为 UTF-8 是不安全的,因为您可能会得到以下结果:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

这同样适用于 ,这在 Python 2 中很流行(默认?请参阅代码页布局中的缺失点 - 这是 Python 因臭名昭著而窒息的地方。latin-1ordinal not in range

更新20150604:有传言说 Python 3 具有将内容编码为二进制数据而不会丢失数据和崩溃的错误策略,但它需要转换测试来验证性能和可靠性。surrogateescape[binary] -> [str] -> [binary]

更新20170116:感谢 Nearoo 的评论 - 还可以使用错误处理程序对所有未知字节进行斜杠转义。这仅适用于 Python 3,因此即使使用此解决方法,您仍然会从不同的 Python 版本获得不一致的输出:backslashreplace

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

有关详细信息,请参阅 Python 的 Unicode 支持

更新20170119:我决定实现适用于 Python 2 和 Python 3 的斜杠转义解码。它应该比解决方案慢,但它应该在每个 Python 版本上产生相同的结果cp437

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

评论

2赞 Karl Knechtel 7/1/2022
这个答案是不正确的。拉丁语-1,即.ISO-8859-1编码完全能够处理任意二进制数据 - 在现代Python版本上运行没有错误,我想不出它失败的原因。Latin-1 的全部意义在于它将每个字节映射到 Unicode 中的前 256 个码位——或者更确切地说,自 1991 年第一个版本以来,就选择了 Unicode 的顺序,以便前 256 个码位与 Latin-1 匹配。打印字符串时可能会遇到问题,但这完全是正交的。bytes(range(256)).decode('latin-1')
29赞 serv-inc 11/13/2015 #8

虽然 @Aaron Maenpaa 的回答很有效,但一位用户最近问道

还有比这更简单的方法吗?'fhand.read().decode(“ASCII”)' [...]太长了!

您可以使用:

command_stdout.decode()

decode()有一个标准参数

codecs.decode(obj, encoding='utf-8', errors='strict')

125赞 lmiguelvargasf 6/29/2016 #9

在 Python 3 中,默认编码是 ,因此您可以直接使用:"utf-8"

b'hello'.decode()

这相当于

b'hello'.decode(encoding="utf-8")

另一方面,在 Python 2 中,编码默认为默认字符串编码。因此,您应该使用:

b'hello'.decode(encoding)

您想要的编码在哪里。encoding

注意:Python 2.7 中添加了对关键字参数的支持。

37赞 jfs 11/16/2016 #10

要将字节序列解释为文本,您必须知道 对应字符编码:

unicode_text = bytestring.decode(character_encoding)

例:

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls命令可能会生成无法解释为文本的输出。文件名 在 Unix 上可以是除斜杠和零之外的任何字节序列:b'/'b'\0'

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用 utf-8 编码解码此类字节汤会引发 .UnicodeDecodeError

情况可能会更糟。如果您使用错误的不兼容编码,解码可能会静默失败并产生 mojibake

>>> '—'.encode('utf-8').decode('cp1252')
'—'

数据已损坏,但程序仍未意识到失败 已经发生。

通常,要使用的字符编码不会嵌入到字节序列本身中。您必须在带外传达此信息。某些结果比其他结果更有可能,因此存在可以猜测字符编码的模块。单个 Python 脚本可以在不同位置使用多个字符编码。chardet


ls输出可以使用函数转换为 Python 字符串,即使对于不可解码也是如此 文件名(它使用 和 错误处理程序 Unix的):os.fsdecode()sys.getfilesystemencoding()surrogateescape

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节,可以使用 .os.fsencode()

如果传递参数,则用于解码字节,例如,它可以在 Windows 上。universal_newlines=Truesubprocesslocale.getpreferredencoding(False)cp1252

要即时解码字节流,io.TextIOWrapper() 可以使用:example

不同的命令可能对其使用不同的字符编码 输出例如,内部命令 () 可以使用 CP437。解码其 输出,您可以显式传递编码 (Python 3.6+):dircmd

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能与(使用 Windows Unicode API),例如,可以替换为 —Python 的 cp437 编解码器映射到控制字符 U+0014 而不是 U+00B6 (¶)。若要支持包含任意 Unicode 字符的文件名,请参阅将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串os.listdir()'\xb6''\x14'b'\x14'

9赞 Taufiq Rahman 1/18/2017 #11

对于 Python 3,这是一种更安全的 Pythonic 方法,可以从 :bytestring

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): # Check if it's in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

输出:

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2
21赞 Broper 11/22/2017 #12

如果您应该通过尝试获得以下内容:decode()

AttributeError:“str”对象没有属性“decode”

您还可以直接在强制转换中指定编码类型:

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'
8赞 bers 3/16/2018 #13

当使用来自 Windows 系统的数据(带有行尾)时,我的答案是\r\n

String = Bytes.decode("utf-8").replace("\r\n", "\n")

为什么?尝试使用多行输入 .txt:

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

All your line endings will be doubled (to ), leading to extra empty lines. Python's text-read functions usually normalize line endings so that strings use only . If you receive binary data from a Windows system, Python does not have a chance to do that. Thus,\r\r\n\n

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

will replicate your original file.

45赞 wim 6/1/2018 #14

Since this question is actually asking about output, you have more direct approaches available. The most modern would be using subprocess.check_output and passing (Python 3.7+) to automatically decode stdout using the system default coding:subprocesstext=True

text = subprocess.check_output(["ls", "-l"], text=True)

For Python 3.6, accepts an encoding keyword:Popen

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

The general answer to the question in the title, if you're not dealing with subprocess output, is to decode bytes to text:

>>> b'abcde'.decode()
'abcde'

With no argument, sys.getdefaultencoding() will be used. If your data is not , then you must specify the encoding explicitly in the decode call:sys.getdefaultencoding()

>>> b'caf\xe9'.decode('cp1250')
'café'
2赞 Leonardo Filipe 6/4/2018 #15
def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

评论

1赞 Dev-iL 6/4/2018
While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. Remember that you are answering the question for readers in the future, not just the person asking now! Please edit your answer to add an explanation, and give an indication of what limitations and assumptions apply. It also doesn't hurt to mention why this answer is more appropriate than others.
0赞 NeilG 1/11/2023
Hi @Dev-iL, if you are a moderator, can you tell me if it's possible for moderators to delete pointless empty incoherent answers like this one: stackoverflow.com/a/68310461/134044
1赞 Dev-iL 1/11/2023
@NeilG I'm not a moderator (note that I have no diamond next to my nickname). If you think a post is low quality, you should report it, and if the community agrees with you - it will be deleted.
2赞 HCLivess 6/1/2019 #16

If you want to convert any bytes, not just string converted to bytes:

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

This is not very efficient, however. It will turn a 2 MB picture into 9 MB.

5赞 user3064538 8/7/2019 #17

For your specific case of "run a shell command and get its output as text instead of bytes", on Python 3.7, you should use subprocess.run and pass in (as well as to capture the output)text=Truecapture_output=True

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True)
command_result.stdout  # is a `str` containing your program's stdout

text used to be called , and was changed (well, aliased) in Python 3.7. If you want to support Python versions before 3.7, pass in instead of universal_newlinesuniversal_newlines=Truetext=True

4赞 Victor Choy 1/19/2020 #18

Try this:

bytes.fromhex('c3a9').decode('utf-8') 
1赞 Ratul Hasan 5/19/2021 #19

Try using this one; this function will ignore all the non-character sets (like UTF-8) binaries and return a clean string. It is tested for Python 3.6 and above.

def bin2str(text, encoding = 'utf-8'):
    """Converts a binary to Unicode string by removing all non Unicode char
    text: binary string to work on
    encoding: output encoding *utf-8"""

    return text.decode(encoding, 'ignore')

Here, the function will take the binary and decode it (converts binary data to characters using the Python predefined character set and the argument ignores all non-character set data from your binary and finally returns your desired value.ignorestring

If you are not sure about the encoding, use to get the default encoding of your device.sys.getdefaultencoding()

22赞 Yasser M 10/21/2021 #20

If you have had this error:

utf-8 codec can't decode byte 0x8a,

then it is better to use the following code to convert bytes to a string:

bytes = b"abcdefg"
string = bytes.decode("utf-8", "ignore") 
17赞 Shubhank Gupta 2/23/2022 #21

We can decode the bytes object to produce a string using . For documentation see bytes.decode.bytes.decode(encoding='utf-8', errors='strict')

Python 3 example:

byte_value = b"abcde"
print("Initial value = {}".format(byte_value))
print("Initial value type = {}".format(type(byte_value)))
string_value = byte_value.decode("utf-8")
# utf-8 is used here because it is a very common encoding, but you need to use the encoding your data is actually in.
print("------------")
print("Converted value = {}".format(string_value))
print("Converted value type = {}".format(type(string_value)))

Output:

Initial value = b'abcde'
Initial value type = <class 'bytes'>
------------
Converted value = abcde
Converted value type = <class 'str'>

Note: In Python 3, by default the encoding type is UTF-8. So, can be also written as <byte_string>.decode("utf-8")<byte_string>.decode()

21赞 Supergamer 6/21/2022 #22

Bytes

m=b'This is bytes'

Converting to string

Method 1

m.decode("utf-8")

m.decode()

Method 2

import codecs
codecs.decode(m,encoding="utf-8")

import codecs
codecs.decode(m)

Method 3

str(m,encoding="utf-8")

str(m)[2:-1]

Result

'This is bytes'
-4赞 Toothpick Anemone 4/23/2023 #23

A potential answer:

#input string
istring = b'pomegranite'

# output string
ostring = str(istring)

评论

3赞 jonrsharpe 4/23/2023
Without an encoding that gives . Hard to see how that adds to the 30+ existing answers. Also it's pomegranate."b'pomegranite'"
2赞 Suyog Shimpi 6/18/2023 #24

One of the best ways to convert to string without caring about any encoding type is as follows -

import json


b_string = b'test string'
string = b_string.decode(
    json.detect_encoding(b_string)  # detect_encoding - used to detect encoding
)
print(string)

Here, we used method to detect the encoding.json.detect_encoding