关于Java：最快的高斯模糊实现

Fastest Gaussian blur implementation

如何实现最快的高斯模糊算法？

我将用Java实现它，因此排除了GPU解决方案。我的应用程序planetGenesis是跨平台的，所以我不需要JNI。

您应该使用一个事实，即高斯核是可分离的。 e。您可以将2D卷积表示为两个1D卷积的组合。

如果滤波器很大，那么使用空间域中的卷积等效于频率(傅里叶)域中的乘法这一事实也是有意义的。这意味着您可以对图像和滤镜进行傅立叶变换，将(复杂)结果相乘，然后进行傅立叶逆变换。 FFT(快速傅立叶变换)的复杂度为O(n log n)，而卷积的复杂度为O(n ^ 2)。另外，如果您需要使用同一滤镜模糊许多图像，则只需对滤镜进行一次FFT。

如果您决定继续使用FFT，则FFTW库是一个不错的选择。

数学笑话很可能知道这一点，但对于其他人来说。

由于高斯的数学性质很好，您可以通过首先在图像的每一行上运行一维高斯模糊，然后在每一列上运行一维模糊，来快速模糊2D图像。

我找到了Quasimondo：孵化器：处理：快速高斯模糊。此方法包含许多近似值，例如使用整数和查找表，而不是浮点数和浮点除法。我不知道现代Java代码中有多少加速。

矩形上的快速阴影具有使用B样条的近似算法。

C＃中的快速高斯模糊算法声称具有一些很酷的优化。

而且，David Everly的Fast Gaussian Blur(PDF)具有用于Gaussian模糊处理的快速方法。

我将尝试各种方法，对其进行基准测试，然后将结果发布在此处。

出于我的目的，我从Internet复制并实现了基本(独立处理X-Y轴)方法和David Everly的Fast Gaussian Blur方法。它们的参数不同，所以我无法直接比较它们。但是，对于较大的模糊半径，后者经过的迭代次数要少得多。同样，后者是一种近似算法。

最终解决方案

我对如此多的信息和实现感到非常困惑，我不知道应该信任哪一个。弄清楚之后，我决定写自己的文章。我希望它可以节省您的时间。

最快的高斯模糊(线性时间)

它包含源代码，(希望如此)源代码简短，干净并且可以轻易地重写为任何其他语言。请投票，以便其他人可以看到。

您可能希望框模糊，这要快得多。请参阅此链接，获取出色的教程以及一些复制和粘贴C代码的信息。

对于较大的模糊半径，请尝试应用三次框模糊。这将很好地近似高斯模糊，并且比真正的高斯模糊快得多。

我会考虑为此使用CUDA或其他一些GPU编程工具包，特别是如果您想使用更大的内核。否则，总会在组装中手动调整循环。

步骤1：SIMD一维高斯模糊
步骤2：移调
步骤3：重复步骤1

最好在小块上完成，因为全图像转置很慢，而小块转置可以使用一连串的PUNPCK(PUNPCKHBW，PUNPCKHDQ，PUNPCKHWD，PUNPCKLBW，PUNPCKLDQ，PUNPCKLWD)来完成。

我在研究过程中为这个问题而苦苦挣扎，并尝试了一种有趣且有趣的方法来实现快速高斯模糊。首先，如前所述，最好将模糊分为两个一维模糊，但是根据实际用于像素值计算的硬件，您实际上可以预先计算所有可能的值并将它们存储在查找表中。

换句话说，预先计算Gaussian coefficient * input pixel value的每个组合。当然，您将需要谨慎对待系数，但是我只想添加此解决方案。如果您订阅了IEEE，则??可以使用查找表进行实时特征提取，以获取快速图像模糊中的更多内容。

最终，我最终还是使用了CUDA :)

我已经将Ivan Kuckir的快速高斯模糊实现转换为Java，它使用了三遍线性框模糊处理。如他在自己的博客中所述，结果处理为O(n)。如果您想了解更多有关为什么3个时间框模糊近似于高斯模糊(3％)我的朋友的信息，您可以查看框模糊和高斯模糊。

这是java实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

@Override
public BufferedImage ProcessImage(BufferedImage image) {
int width = image.getWidth();
int height = image.getHeight();

int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
int[] changedPixels = new int[pixels.length];

FastGaussianBlur(pixels, changedPixels, width, height, 12);

BufferedImage newImage = new BufferedImage(width, height, image.getType());
newImage.setRGB(0, 0, width, height, changedPixels, 0, width);

return newImage;
}

private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}

private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);

int filterWidth = (int) Math.floor(idealFilterWidth);

if (filterWidth % 2 == 0) {
filterWidth--;
}

int filterWidthU = filterWidth + 2;

double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
double m = Math.round(mIdeal);

ArrayList<Integer> result = new ArrayList<>();

for (int i = 0; i < n; i++) {
result.add(i < m ? filterWidth : filterWidthU);
}

return result;
}

private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
System.arraycopy(source, 0, output, 0, source.length);
BoxBlurHorizantal(output, source, width, height, radius);
BoxBlurVertical(source, output, width, height, radius);
}

private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius);
for (int i = 0; i < height; i++) {
int outputIndex = i * width;
int li = outputIndex;
int sourceIndex = outputIndex + radius;

int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
float val = (radius) * fv;

for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
}

for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}

for (int j = (radius + 1); j < (width - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}

for (int j = (width - radius); j < width; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
}
}

private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius + 1);
for (int i = 0; i < width; i++) {
int outputIndex = i;
int li = outputIndex;
int sourceIndex = outputIndex + radius * width;

int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
float val = (radius + 1) * fv;

for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
}
for (int j = 0; j <= radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
sourceIndex += width;
outputIndex += width;
}
for (int j = radius + 1; j < (height - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
sourceIndex += width;
outputIndex += width;
}
for (int j = (height - radius); j < height; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
outputIndex += width;
}
}
}

在1D中：

反复使用几乎所有内核进行模糊处理都倾向于使用高斯内核。这就是高斯分布的奇妙之处，也是统计学家喜欢它的原因。因此，请选择易于模糊的内容并将其应用几次。

例如，使用盒形内核很容易模糊。首先计算一个累计和：

1	y(i) = y(i-1) + x(i)

然后：

1	blurred(i) = y(i+radius) - y(i-radius)

重复几次。

或者您可能会使用各种IIR滤波器来回移动几次，这些速度同样快。

在2D或更高版本中：

正如DarenW所说，每个维度都一个接一个地模糊。

来自CWP的Dave Hale有一个minejtk软件包，其中包括递归高斯滤波器(Deriche方法和Van Vliet方法)。可以在https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java找到Java子例程

对于高斯模糊(以及对于高斯的导数)，Deriche的方法似乎是一种非常好的方法。

有几种快速的2d数据高斯模糊方法。您应该知道的。

这是可分离的滤波器，因此只需要两个1d卷积。

对于大内核，您可以处理缩减的图像副本，而不是进行高端缩放。

可以通过多个盒式滤波器(也可分离)来实现良好的近似(可以调整迭代次数和内核大小)

现有的O(n)复杂度算法(对于任何内核大小)都可以通过IIR滤波器进行精确的高斯近似。

您的选择取决于所需的速度，精度和实现复杂性。

尝试按照此处的方法使用Box Blur：
使用扩展框模糊逼近高斯模糊

这是最好的近似值。

使用积分图像可以使速度更快。
如果这样做，请分享您的解决方案。

用现在(截至2016年)已实现的新库回答这个老问题，因为Java的GPU技术有了许多新的进步。

正如其他一些答案所建议的那样，CUDA是一种替代方法。但是Java现在已经支持CUDA。