Python 替换除换行符之外的不可打印字符-解网

问：

我正在尝试编写一个用空格替换不可打印字符的函数，效果很好，但它也用空格替换换行符。我不知道为什么。\n

测试代码：

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x20-\x7E]', ' ', input_string)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\x0Ais\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

输出：

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters: Hello World This is 28a 29test.

如您所见，Hello World 之前的换行符被空格替换，这不是故意的。我试图从 ChatGPT 获得帮助，但它的正则表达式解决方案不起作用。

我最后的手段是使用 for 循环并使用 Python 内置方法来过滤掉字符，但与正则表达式相比，这会慢得多。isprintable()

python-3.x ascii python-re 非打印字符

@Andj 是的，这将得出我想要的结论，即使是 ASCII 也会这样做，具体来说，我的目标是删除有趣的字符。我的代码处理原始文本，在处理带有制表符、其他语言的有趣空格或不可打印字符的文本时，我需要在进一步处理之前将它们转换为 ASCII。space

1赞 Andj 11/2/2023

它还将删除标点符号，因为许多英语标点符号存在于基本拉丁语范围之外，它将从英语外来词中删除带有基本拉丁语之外字符的字母。归根结底，它归结为原始文本的来源、文本涵盖的领域、文本来自哪些国家。

答：

-1赞 Diego Torres Milano 10/29/2023 #1

反其道而行之，然后跳过\x0A

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable characters (including escape sequences) with spaces
    # According to ChatGPT, \n should not be in this range
    cleaned_string = re.sub(r'[^\x00-\x09\x11-\x1F]', ' ', input_string)

    return cleaned_string

import re

def replace_unknown_characters_with_space(input_string):
    # Replace non-printable ASCII characters (including escape sequences) with spaces
    cleaned_string = re.sub(r'[^][\w\n!"\#$%&\'()*+,./:;<=>?@\\\^_`{|}~-]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0AThis\x0Dis\u2028a\u2029test. including some punctuation like `({~})' and even \\, and \" + words like <año> or numbers like \u1bb1\nText cant be \033[1m[bold]\033[0m or go \u2067backwards\u2069, but can also contain wide numbers like \uff11 or ０"

    print("Original String:")
    print(test_string)

    cleaned_string = replace_unknown_characters_with_space(test_string)

    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

import re

def replace_unknown_characters_with_space(input_string):
    # Replace all non printable ascii, excluding \n from the expression
    cleaned_string = re.sub(r'[^\x20-\x7E\n]', ' ', input_string, flags=re.DOTALL)

    return cleaned_string

def main():
    test_string = "This is a test string with some unprintable characters:\nHello\x85World\x0DThis\nis\x2028a\x2029test."
    
    print("Original String:")
    print(test_string)
    
    cleaned_string = replace_unknown_characters_with_space(test_string)
    
    print("\nCleaned String:")
    print(cleaned_string)

if __name__ == "__main__":
    main()

输出

Original String:
This is a test string with some unprintable characters:
Hello
Thisd
is 28a 29test.

Cleaned String:
This is a test string with some unprintable characters:
Hello World This
is 28a 29test.

\n不再被替换

上一个：使用 Python 顺时针和逆时针旋转阵列 ASCII 艺术

下一个：浏览器可以呈现存储在代码点 0x09 的字形吗？

Python 替换除换行符之外的不可打印字符

Python replace unprintable characters except linebreak

评论

评论

评论

评论