如何判断一个元素在亿级数据中是否存在？ (2)

日期：2021-06-17 栏目：程序人生浏览：次

基于以上的 Hash 冲突的前提，所以 Bloom Filter 有一定的误报率，这个误报率和 Hash 算法的次数 H，以及数组长度 L 都是有关的。

自己实现一个布隆过滤

算法其实很简单不难理解，于是利用 Java 实现了一个简单的雏形。

public class BloomFilters { /** * 数组长度 */ private int arraySize; /** * 数组 */ private int[] array; public BloomFilters(int arraySize) { this.arraySize = arraySize; array = new int[arraySize]; } /** * 写入数据 * @param key */ public void add(String key) { int first = hashcode_1(key); int second = hashcode_2(key); int third = hashcode_3(key); array[first % arraySize] = 1; array[second % arraySize] = 1; array[third % arraySize] = 1; } /** * 判断数据是否存在 * @param key * @return */ public boolean check(String key) { int first = hashcode_1(key); int second = hashcode_2(key); int third = hashcode_3(key); int firstIndex = array[first % arraySize]; if (firstIndex == 0) { return false; } int secondIndex = array[second % arraySize]; if (secondIndex == 0) { return false; } int thirdIndex = array[third % arraySize]; if (thirdIndex == 0) { return false; } return true; } /** * hash 算法1 * @param key * @return */ private int hashcode_1(String key) { int hash = 0; int i; for (i = 0; i < key.length(); ++i) { hash = 33 * hash + key.charAt(i); } return Math.abs(hash); } /** * hash 算法2 * @param data * @return */ private int hashcode_2(String data) { final int p = 16777619; int hash = (int) 2166136261L; for (int i = 0; i < data.length(); i++) { hash = (hash ^ data.charAt(i)) * p; } hash += hash << 13; hash ^= hash >> 7; hash += hash << 3; hash ^= hash >> 17; hash += hash << 5; return Math.abs(hash); } /** * hash 算法3 * @param key * @return */ private int hashcode_3(String key) { int hash, i; for (hash = 0, i = 0; i < key.length(); ++i) { hash += key.charAt(i); hash += (hash << 10); hash ^= (hash >> 6); } hash += (hash << 3); hash ^= (hash >> 11); hash += (hash << 15); return Math.abs(hash); } }

首先初始化了一个 int 数组。

写入数据的时候进行三次 hash 运算，同时把对应的位置置为 1。

查询时同样的三次 hash 运算，取到对应的值，一旦值为 0 ，则认为数据不存在。

实现逻辑其实就和上文描述的一样。

下面来测试一下，同样的参数：

-Xms64m -Xmx64m -XX:+PrintHeapAtGC @Test public void bloomFilterTest(){ long star = System.currentTimeMillis(); BloomFilters bloomFilters = new BloomFilters(10000000) ; for (int i = 0; i < 10000000; i++) { bloomFilters.add(i + "") ; } Assert.assertTrue(bloomFilters.check(1+"")); Assert.assertTrue(bloomFilters.check(2+"")); Assert.assertTrue(bloomFilters.check(3+"")); Assert.assertTrue(bloomFilters.check(999999+"")); Assert.assertFalse(bloomFilters.check(400230340+"")); long end = System.currentTimeMillis(); System.out.println("执行时间：" + (end - star)); }

执行结果如下：

如何判断一个元素在亿级数据中是否存在？

只花了 3 秒钟就写入了 1000W 的数据同时做出来准确的判断。

如何判断一个元素在亿级数据中是否存在？

当让我把数组长度缩小到了 100W 时就出现了一个误报，400230340 这个数明明没在集合里，却返回了存在。

这也体现了 Bloom Filter 的误报率。

我们提高数组长度以及 hash 计算次数可以降低误报率，但相应的 CPU、内存的消耗就会提高；这就需要根据业务需要自行权衡。

Guava 实现

如何判断一个元素在亿级数据中是否存在？

刚才的方式虽然实现了功能，也满足了大量数据。但其实观察 GC 日志非常频繁，同时老年代也使用了 90%，接近崩溃的边缘。

总的来说就是内存利用率做的不好。

其实 Google Guava 库中也实现了该算法，下面来看看业界权威的实现。

-Xms64m -Xmx64m -XX:+PrintHeapAtGC @Test public void guavaTest() { long star = System.currentTimeMillis(); BloomFilter<Integer> filter = BloomFilter.create( Funnels.integerFunnel(), 10000000, 0.01); for (int i = 0; i < 10000000; i++) { filter.put(i); } Assert.assertTrue(filter.mightContain(1)); Assert.assertTrue(filter.mightContain(2)); Assert.assertTrue(filter.mightContain(3)); Assert.assertFalse(filter.mightContain(10000000)); long end = System.currentTimeMillis(); System.out.println("执行时间：" + (end - star)); }

也是同样写入了 1000W 的数据，执行没有问题。

如何判断一个元素在亿级数据中是否存在？