提问人:Electron X 提问时间:10/29/2023 更新时间:11/6/2023 访问量:130
Python 替换除换行符之外的不可打印字符
Python replace unprintable characters except linebreak
问:
我正在尝试编写一个用空格替换不可打印字符的函数,效果很好,但它也用空格替换换行符。我不知道为什么。\n
测试代码:
import re
def replace_unknown_characters_with_space(input_string):
# Replace non-printable characters (including escape sequences) with spaces
# According to ChatGPT, \n should not be in this range
cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)
return cleaned_string
def main():
test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
print("Original String:")
print(test_string)
cleaned_string = replace_unknown_characters_with_space(test_string)
print("\nCleaned String:")
print(cleaned_string)
if __name__ == "__main__":
main()
输出:
Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.
Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.
如您所见,Hello World 之前的换行符被空格替换,这不是故意的。我试图从 ChatGPT 获得帮助,但它的正则表达式解决方案不起作用。
我最后的手段是使用 for 循环并使用 Python 内置方法来过滤掉字符,但与正则表达式相比,这会慢得多。isprintable()
答:
反其道而行之,然后跳过\x0A
def replace_unknown_characters_with_space(input_string):
# Replace non-printable characters (including escape sequences) with spaces
# According to ChatGPT, \n should not be in this range
cleaned_string = re.sub(r'[^\x00-\x09\x11-\x1F]', ' ', input_string)
return cleaned_string
评论
\x0a
你不需要为此。您可以使用内置功能。
只需构建一个转换表,然后 str.translate()
例如:
TEST_STRING = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
TDICT = {c: " " for c in range(32) if c != 10}
print(TEST_STRING.translate(TDICT))
输出:
This is a test string with some unprintable characters:
Hello
World This
is 28a 29test.
注意:
一旦你辨别了正确的正则表达式,re 就比 str.translate 快得多
评论
dict.fromkeys((c for c in range(32) if c != 10), " ")
这个问题似乎有多个部分,所以让我们独立解决它们
为什么“\n”会受到影响?
“\n”是正则表达式的特殊字符,因为它们设计为在行中操作,而“\n”表示行的末尾。
正如你所发现的,你不妨在要匹配的文本中包含“\n”,但随后需要让 RE 引擎知道它不应该特别对待它,为此你可以使用标志。re.DOTALL
“可打印”是什么意思?
“printable”的含义比 ChatGPT 建议的更广泛,这似乎是 POSIX ASCII 类 [:p rint:](我会推荐 [:graph:] 代替)到 Python 的翻译;从您的测试中可以看出,您可能对删除打印时会影响输出的“有趣字符”更感兴趣。
您的测试包括可能被 ChatGPT 误译为 UTF-8 空格的字符(是识别正则表达式中的字符的更好选择,Python 不对大于 255 的代码点使用 \x,而是使用具有相似数字的 \u,因此它们可能来自 PCRE 语法)\s
由于您包含 UTF-8 空格,而 python 字符串是 UTF-8,因此过滤掉“有趣的 UTF-8 字符”(如 BIDI 控件类)似乎是合乎逻辑的,如果您计划稍后打印该字符串,这将具有与“\r”类似的效果。
如果您认为任何非 ASCII 字符是“有趣的”,那么解决方案也需要更改。
以下示例的以下版本(带有更正的测试文本和一些扩展)可以被认为是“正确的”,但我怀疑在您完善需求时需要进一步更改。
import re
def replace_unknown_characters_with_space(input_string):
# Replace non-printable ASCII characters (including escape sequences) with spaces
cleaned_string = re.sub(r'[^][\w\n!"\#$%&\'()*+,./:;<=>?@\\\^_`{|}~-]', ' ', input_string, flags=re.DOTALL)
return cleaned_string
def main():
test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0AThis\x0Dis\u2028a\u2029test. including some punctuation like `({~})' and even \\, and \" + words like <año> or numbers like \u1bb1\nText cant be \033[1m[bold]\033[0m or go \u2067backwards\u2069, but can also contain wide numbers like \uff11 or 0"
print("Original String:")
print(test_string)
cleaned_string = replace_unknown_characters_with_space(test_string)
print("\nCleaned String:")
print(cleaned_string)
if __name__ == "__main__":
main()
评论
(
"
受 Carlo Arenas 的回答启发的修改正则表达式。
法典:
import re
def replace_unknown_characters_with_space(input_string):
# Replace all non printable ascii, excluding \n from the expression
cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)
return cleaned_string
def main():
test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
print("Original String:")
print(test_string)
cleaned_string = replace_unknown_characters_with_space(test_string)
print("\nCleaned String:")
print(cleaned_string)
if __name__ == "__main__":
main()
输出
Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.
Cleaned String:
This is a test string with some unprintable characters:
Hello World This
is 28a 29test.
\n
不再被替换
评论
space