.NET下文本相似度算法余弦定理和SimHash浅析及应用

原理:首先我们先把两段文本分词,列出来所有单词,其次我们计算每个词语的词频,最后把词语转换为向量,这样我们就只需要计算两个向量的相似程度.
 
我们简单表述如下
 
文本1:我/爱/北京/天安门/ 经过分词求词频得出向量(伪向量)  [1,1,1,1]
 
文本2:我们/都爱/北京/天安门/ 经过分词求词频得出向量(伪向量)  [1,0,1,2]
 
我们可以把它们想象成空间中的两条线段,都是从原点([0, 0, ...])出发,指向不同的方向。两条线段之间形成一个夹角,如果夹角为0度,意味着方向相同、线段重合;如果夹角为90度,意味着形成直角,方向完全不相似;如果夹角为180度,意味着方向正好相反。因此,我们可以通过夹角的大小,来判断向量的相似程度。夹角越小,就代表越相似。
 
C#核心算法:

复制代码 代码如下:

    public class TFIDFMeasure
    {
        private string[] _docs;
        private string[][] _ngramDoc;
        private int _numDocs=0;
        private int _numTerms=0;
        private ArrayList _terms;
        private int[][] _termFreq;
        private float[][] _termWeight;
        private int[] _maxTermFreq;
        private int[] _docFreq;
 
        public class TermVector
        {       
            public static float ComputeCosineSimilarity(float[] vector1, float[] vector2)
            {
                if (vector1.Length != vector2.Length)               
                    throw new Exception("DIFER LENGTH");
               
 
                float denom=(VectorLength(vector1) * VectorLength(vector2));
                if (denom == 0F)               
                    return 0F;               
                else               
                    return (InnerProduct(vector1, vector2) / denom);
               
            }
 
            public static float InnerProduct(float[] vector1, float[] vector2)
            {
           
                if (vector1.Length != vector2.Length)
                    throw new Exception("DIFFER LENGTH ARE NOT ALLOWED");
               
           
                float result=0F;
                for (int i=0; i < vector1.Length; i++)               
                    result += vector1[i] * vector2[i];
               
                return result;
            }
       
            public static float VectorLength(float[] vector)
            {           
                float sum=0.0F;
                for (int i=0; i < vector.Length; i++)               
                    sum=sum + (vector[i] * vector[i]);
                       
                return (float)Math.Sqrt(sum);
            }
        }
 
        private IDictionary _wordsIndex=new Hashtable() ;
 
        public TFIDFMeasure(string[] documents)
        {
            _docs=documents;
            _numDocs=documents.Length ;
            MyInit();
        }
 
        private void GeneratNgramText()
        {
           
        }
 
        private ArrayList GenerateTerms(string[] docs)
        {
            ArrayList uniques=new ArrayList() ;
            _ngramDoc=new string[_numDocs][] ;
            for (int i=0; i < docs.Length ; i++)
            {
                Tokeniser tokenizer=new Tokeniser() ;
                string[] words=tokenizer.Partition(docs[i]);           
 
                for (int j=0; j < words.Length ; j++)
                    if (!uniques.Contains(words[j]) )               
                        uniques.Add(words[j]) ;
            }
            return uniques;
        }

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wjzzzx.html