对JavaScript的全文搜索实现相关度评分的功能的方(2)

日期：2020-06-10 栏目：程序人生浏览：次

这就是 “词语频率饱和度。原生的 TF-IDF 算法没有饱和的概念，所以出现 80 次“棒球”的文档要比出现 40 次的得分高一倍。有些时候，这时我们所希望的，但有些时候我们并不希望这样。

此外，Okapi BM25 还有个 k1 参数，它用于调节饱和度变化的速率。k1 参数的值一般介于 1.2 到 2.0 之间。数值越低则饱和的过程越快速。（意味着两个上面两个文档有相同的分数，因为他们都包含大量的“棒球”这个词语）

字段长度归约（Field-length normalization）将文档的长度归约化到全部文档的平均长度上。这对于单字段集合（single-field collections）（例如 ours）很有用，可以将不同长度的文档统一到相同的比较条件上。对于双字段集合（例如 “title” 和 "body"）更加有意义，它同样可以将 title 和 body 字段统一到相同的比较条件上。字段长度归约用 b 来表示，它的值在 0 和 1 之间，1 意味着全部归约化，0 则不进行归约化。

算法

在Okapi BM25 维基百科中你可以了解Okapi算法的公式。既然都知道了式子中的每一项是什么，这肯定是很容易地就理解了。所以我们就不提公式，直接进入代码：

BM25.Tokenize = function(text) { text = text .toLowerCase() .replace(/\W/g, ' ') .replace(/\s+/g, ' ') .trim() .split(' ') .map(function(a) { return stemmer(a); }); // Filter out stopStems var out = []; for (var i = 0, len = text.length; i < len; i++) { if (stopStems.indexOf(text[i]) === -1) { out.push(text[i]); } } return out; };

我们定义了一个简单的静态方法Tokenize()，目的是为了解析字符串到tokens的数组中。就这样，我们小写所有的tokens（为了减少熵）。我们运行Porter Stemmer 算法来减少熵的量同时也提高匹配程度（“walking”和"walk"匹配是相同的）。而且我们也过滤掉停用词（很普通的词）为了更近一步减少熵值。在我所写的概念深入之前，如果我过于解释这一节就请多担待。

BM25.prototype.addDocument = function(doc) { if (typeof doc.id === 'undefined') { throw new Error(1000, 'ID is a required property of documents.'); }; if (typeof doc.body === 'undefined') { throw new Error(1001, 'Body is a required property of documents.'); }; // Raw tokenized list of words var tokens = BM25.Tokenize(doc.body); // Will hold unique terms and their counts and frequencies var _terms = {}; // docObj will eventually be added to the documents database var docObj = {id: doc.id, tokens: tokens, body: doc.body}; // Count number of terms docObj.termCount = tokens.length; // Increment totalDocuments this.totalDocuments++; // Readjust averageDocumentLength this.totalDocumentTermLength += docObj.termCount; this.averageDocumentLength = this.totalDocumentTermLength / this.totalDocuments; // Calculate term frequency // First get terms count for (var i = 0, len = tokens.length; i < len; i++) { var term = tokens[i]; if (!_terms[term]) { _terms[term] = { count: 0, freq: 0 }; }; _terms[term].count++; } // Then re-loop to calculate term frequency. // We'll also update inverse document frequencies here. var keys = Object.keys(_terms); for (var i = 0, len = keys.length; i < len; i++) { var term = keys[i]; // Term Frequency for this document. _terms[term].freq = _terms[term].count / docObj.termCount; // Inverse Document Frequency initialization if (!this.terms[term]) { this.terms[term] = { n: 0, // Number of docs this term appears in, uniquely idf: 0 }; } this.terms[term].n++; }; // Calculate inverse document frequencies // This is SLOWish so if you want to index a big batch of documents, // comment this out and run it once at the end of your addDocuments run // If you're only indexing a document or two at a time you can leave this in. // this.updateIdf(); // Add docObj to docs db docObj.terms = _terms; this.documents[docObj.id] = docObj; };

这就是addDocument()这种方法会奇迹般出现的地方。我们基本上建立和维护两个类似的数据结构:this.documents.和this.terms。

this.documentsis 是一个保存着所有文档的数据库，它保存着文档的全部原始文字，文档的长度信息和一个列表，列表里面保存着文档中的所有词语和词语的数量与出现频率。使用这个数据结构，我们可以很容易的和快速的（是的，非常快速，只需要时间复杂度为O(1)的哈表查询时间）回答如下问题：在文档 #3 中，'walk' 这个词语出现了多少次？

我们在还使用了另一个数据结构，this.terms。它表示语料库中的所有词语。通过这个数据结构，我们可以在O(1)时间内回答如下问题：'walk' 这个词在多少个文档中出现过？他们的 id 是什么？

最后，我们记录了每个文档的长度，并记录了整个语料库中文档的平均长度。

转载注明出处：https://www.heiqu.com/wgdjsj.html

对JavaScript的全文搜索实现相关度评分的功能的方(2)

相关推荐