.NET下文本相似度算法余弦定理和SimHash浅析及应用(4)

日期：2020-06-12 栏目：程序人生浏览：次

private int CountWords(string word, string[] words)
        {
            int itemIdx=Array.BinarySearch(words, word);

            if (itemIdx > 0)
                while (itemIdx > 0 && words[itemIdx].Equals(word))
                    itemIdx--;
            int count=0;
            while (itemIdx < words.Length && itemIdx >= 0)
            {
                if (words[itemIdx].Equals(word)) count++;
                itemIdx++;
                if (itemIdx < words.Length)
                    if (!words[itemIdx].Equals(word)) break;
            }
            return count;
        }
}

缺点：

由于有可能一个文章的特征向量词特别多导致整个向量维度很高，使得计算的代价太大不适合大数据量的计算。

SimHash原理：

算法的主要思想是降维，将高维的特征向量映射成一个f-bit的指纹(fingerprint)，通过比较两篇文章的f-bit指纹的Hamming Distance来确定文章是否重复或者高度近似。由于每篇文章我们都可以事先计算好Hamming Distance来保存，到时候直接通过Hamming Distance来计算，所以速度非常快适合大数据计算。

Google就是基于此算法实现网页文件查重的。我们假设有以下三段文本：

1，the cat sat on the mat

2，the cat sat on a mat

3，we all scream for ice cream

如何实现这种hash算法呢？以上述三个文本为例，整个过程可以分为以下六步：
1、选择simhash的位数，请综合考虑存储成本以及数据集的大小，比如说32位
2、将simhash的各位初始化为0
3、提取原始文本中的特征，一般采用各种分词的方式。比如对于"the cat sat on the mat"，采用两两分词的方式得到如下结果：{"th", "he", "e ", " c", "ca", "at", "t ", " s", "sa", " o", "on", "n ", " t", " m", "ma"}
4、使用传统的32位hash函数计算各个word的hashcode，比如："th".hash = -502157718
，"he".hash = -369049682，……
5、对各word的hashcode的每一位，如果该位为1，则simhash相应位的值加1；否则减1
6、对最后得到的32位的simhash，如果该位大于1，则设为1；否则设为0

转载注明出处：https://www.heiqu.com/wjzzzx.html

.NET下文本相似度算法余弦定理和SimHash浅析及应用(4)

相关推荐