从python中的字符串中剥离不可打印的字符

Stripping non printable characters from a string in python

我用来跑步

1	$s =~ s/[^[:print:]]//g;

在Perl上摆脱不可打印的字符。

在Python中，没有POSIX正则表达式类，并且我不能写[：print：]来表示我想要的意思。我不知道在Python中无法检测字符是否可打印。

你会怎么做？

编辑：它也必须支持Unicode字符。 string.printable方式会很乐意将它们从输出中剥离。
curses.ascii.isprint将为任何unicode字符返回false。

不幸的是，在Python中遍历字符串相当慢。对于这种事情，正则表达式的速度要快一个数量级。您只需要自己构建角色类即可。 unicodedata模块对此非常有帮助，尤其是unicodedata.category()函数。有关类别的说明，请参见Unicode字符数据库。

1
2
3
4
5
6
7
8
9
10
11

import unicodedata, re

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0,32) + range(127,160)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
return control_char_re.sub('', s)

您可以尝试使用unicodedata.category()函数设置过滤器：

1
2
3
4

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
return ''.join(c for c in str if unicodedata.category(c) in printable)

有关可用类别，请参见Unicode数据库字符属性中的表4-9(第175页)。

在Python 3中，

1
2
3
4
5
6

def filter_nonprintable(text):
import string
# Get the difference of all ASCII characters from the set of printable characters
nonprintable = set([chr(i) for i in range(128)]).difference(string.printable)
# Use translate to remove all non-printable characters
return text.translate({ord(character):None for character in nonprintable})

请参阅有关删除标点符号的StackOverflow帖子，了解.translate()与regex和.replace()的比较方式

此函数使用列表推导和str.join，因此它以线性时间而不是O(n ^ 2)运行：

1
2
3
4

from curses.ascii import isprint

def printable(input):
return ''.join(char for char in input if isprint(char))

In Python there's no POSIX regex classes

使用regex库时有以下内容：https://pypi.org/project/regex/

它维护良好，并支持Unicode regex，Posix regex等。用法(方法签名)与Python的re非常相似。

从文档中：

[[:alpha:]]; [[:^alpha:]]

POSIX character classes are supported. These
are normally treated as an alternative form of \p{...}.

(我没有隶属关系，只是一个用户。)

下面的一个比上面的其他人执行得更快。看一看

1	''.join([x if x in string.printable else '' for x in Str])

以下将适用于Unicode输入并且相当快...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
"""Replace non-printable characters in a string."""

# the translate method on str removes characters
# that map to None from the string
return s.translate(NOPRINT_TRANS_TABLE)

assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

我自己的测试表明，这种方法比在字符串上迭代并使用str.join返回结果的函数更快。

python 3中的另一个选项：