自然语言处理真实项目实战（20170822） (2)

日期：2021-11-15 栏目：程序人生浏览：次

fit(X, y[, n_jobs]) 对训练集X, y进行训练。是对scipy.linalg.lstsq的封装
score(X, y[,]sample_weight) 定义为(1-u/v)，其中u = （（y_true - y_pred）**2）.sum()，而v=((y_true-y_true.mean())**2).mean()
最好的得分为1.0，一般的得分都比1.0低，得分越低代表结果越差。
其中sample_weight为(samples_n,)形状的向量，可以指定对于某些sample的权值，如果觉得某些数据比较重要，可以将其的权值设置的大一些。

代码和处理流程语料库的准备

语料库的准备，就是将你准备好的文章库，转换为一个语料库。
你的文章一般会被保存为TaggedDocument，也就是带有标签的文档。
一篇文章对应着一个TaggedDocument对象。
TaggedDocument里面存放的是Token列表和Tag：
其中Token列表就是将文章通过分词软件分成的词语的列表，Tag这里保存着原来文章的编号。
下面这个代码中 tdocs变量表示一个TaggedDocument数组。

注意：在gensim以前版本中TaggedDocument是LabeledSentence

corpus = Doc2Vec(tdocs, dm=1, dm_mean=1, size=300, window=8, min_count=2, workers=4, iter=20) corpus.save(os.path.join(WORK_DIR, \'base-pv_dm.mdl\'))

关于这个函数的参数介绍，可以参考这里，全英文非常晦涩难懂的介绍：
https://radimrehurek.com/gensim/models/doc2vec.html

dm defines the training algorithm. By default (dm=1), ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
dm：定义了训练的算法，默认值为1，使用 ‘distributed memory’方法，不然则使用分布式的“bag of words” 方法。（dm，应该是doc model的意思，文档模型，这个需要进一步调查）
size is the dimensionality of the feature vectors.
size:是向量的维度，本项目维度设定是300。维度这个参数也是需要通过大量实验获得最佳的值。
dm_mean = if 0 (default), use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
dm_mean：如果是默认值0，则使用上下文向量的和（SUM），如果是1的话，则使用上下文向量的平均值。这个仅仅在dm使用non-concatenative的模式才发生效果。
workers = use this many worker threads to train the model (=faster training with multicore machines).
如果是多核处理器，这里可以指定并行数
iter = number of iterations (epochs) over the corpus. The default inherited from Word2Vec is 5, but values of 10 or 20 are common in published ‘Paragraph Vector’ experiments.
迭代次数：默认的迭代次数是5，但是最佳实践应该是10或者20.
min_count = ignore all words with total frequency lower than this.
如果出现频率少于min_count，则忽略
window is the maximum distance between the predicted word and context words used for prediction within a document.
window是被预测词语和上下文词语在同一个文档中的最大的距离。

语料库也是支持序列化操作的，语料库可以保存为磁盘上的文件：

Save the object to file (also see load).
fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

语料库建成之后，就可以进行一些有趣的检索了。
例如参考文档 [Algorithm & NLP] 文本深度表示模型——word2vec&doc2vec词向量模型中的句子相似度实验：

下面是sentence2vec的结果示例。先利用中文sentence语料训练句向量，然后通过计算句向量之间的cosine值，得到最相似的句子。可以看到句向量在对句子的语义表征上还是相当惊叹的。

句子相似度结果

相似检索

转载注明出处：https://www.heiqu.com/zwzjyd.html

自然语言处理真实项目实战（20170822） (2)

相关推荐