关于列表：“坏词”过滤器

“bad words” filter

技术性不是很高，但是...我必须在我们正在开发的新站点中实施一个单词过滤器。因此，我需要一个"好"坏词列表来为我的数据库提供...任何提示/方向？环顾谷歌，我发现了这个，这是一个开始，但仅此而已。

是的，我知道这种过滤器很容易逃脱……但是客户就是客户！ :-)

该站点将必须过滤掉英语和意大利语单词，但是对于意大利语，我可以请同事们帮我提供社区构建的" parolacce"列表:-)-可以使用电子邮件。

谢谢你的帮助。

当心集体错误。

"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"

Hmm."clbuttic".

Google"clbuttic" - thousands of hits!

There's someone who call his car 'clbuttic'.

There are"Clbuttic Steam Engine" message boards.

Webster's dictionary - no help.

Hmm. What can this be?

HINT: People who make buttumptions about their regex scripts, will be
embarbutted when they repeat this mbuttive mistake.

我没有看到指定的任何语言，但是您可以将其用于PHP，它将为每个已插入的作品生成一个RegEx，这样即使是故意的拼写错误(即@ ss，i3itch)也将被捕获。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

<?php

/**
* @author unkwntech@unkwndesign.com
**/

if($_GET['act'] == 'do')
{
$pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
$pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
$pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
$pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
$pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
$pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
$pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
$pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
$pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
$pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
$pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
$pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
$pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
$pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
$pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
$pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
$pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
$pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
$pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
$pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
$pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
$pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
$pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
$pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
$pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
$pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
$word = str_split(strtolower($_POST['word']));
$i=0;
while($i < count($word))
{
if(!is_numeric($word[$i]))
{
if($word[$i] != ' ' || count($word[$i]) < '1')
{
$word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
}
}
$i++;
}
//$word ="/" . implode('', $word) ."/";
echo implode('', $word);
}

if($_GET['act'] == 'list')
{
$link = mysql_connect('localhost', 'username', 'password', '1');
mysql_select_db('peoples');
$sql ="SELECT word FROM filters";
$result = mysql_query($sql, $link);
$i=0;
while($i < mysql_num_rows($result))
{
echo mysql_result($result, $i, 'word') ."<br />";
$i++;
}
echo '';
}
?>
<html>
<head>
RegEx Generator
</head>
<body>
<form action='badword.php?act=do' method='post'>
Word: <input type='text' name='word' /><br />
<input type='submit' value='Generate' />
</form>
List Words
</body>
</html>

Shutterstock有一个Github存储库，其中包含用于过滤的不良单词列表。

您可以在这里查看：https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

如果有人需要API，则Google当前会提供错误词指示符。

1
2
3
4
5

http://www.wdyl.com/profanity?q=naughtyword

{
response:"false"
}

更新：Google现在已删除此服务。

+1是关于Clbuttic错误的，我认为"坏词"过滤器扫描开头和结尾的空格(例如" ass")而不是仅扫描确切的字符串非常重要，这样我们就不会出现像clbuttic这样的词，紧身裤，黄油，奶油等。

我想说的是，在您知道帖子后就删除它们，并阻止对其帖子过分露骨的用户。您可以不说脏话就说很令人反感的话。如果您阻止"屁股"(又名驴)一词，那么人们将只键入a $$或/ 55，或者他们需要键入以通过过滤器的其他任何内容。

Wikipedia ClueBot的单词过滤器不正确，请阅读其源代码。

http://en.wikipedia.org/wiki/User:ClueBot/Source#Score_list

您总是可以说服客户进行一次用户会话，而这些用户只是不断发布专家信息，并提出一种简单的解决方案将其添加到系统中。这是很多工作，但可能会更代表社区。

在研究此主题时，我确定需要的不仅仅是一个可以任意替换的列表。我建立了一个Web服务，可让您确定所需的"清洁"程度。它还会努力识别误报-也就是说，在一个上下文中某个单词可能是不好的，而在其他上下文中则不是。
看看http://filterlanguage.com