相对而言,深度优先搜索获得的结果可以匹配到编辑距离更大的子词,模型的容错边界被外推到更大的范围;宽度优先搜索获得的近邻词语间的区分度更加平滑,对下游任务的应用更加友好。而对于编辑距离相同的子词对,由于都考虑了n-gram词序信息,两种搜索方式返回的子词对的相对排序也基本相同。
整体而言,相对于上面提到的两种构图和游走方式,node2vec 模型获得的词向量效果相对更好,且宽度优先搜索的结果更符合业务需求。
五、总结本文回顾了NLP领域当前主要的文本分布式表示学习方法,针对中文搜索场景下同音词、易混词等文本的相似匹配问题,尝试从图计算的角度提出一种词向量训练方法,使得模型学习到的词向量在中文词形学角度相近的词语在向量空间中也拥有较近的距离。
通过对比 combination style 和 fasttext style 两种不同的构图方式以及 node2vec 深度优先、node2vec 宽度优先和 metapath2vec 三种不同的边采样方法得到的词嵌入在业务应用中的效果,探索了图计算在文本表示学习中的应用,为提升业务效果提供了积极的帮助。
目前这一工作已经应用于腾讯云企业画像产品搜索业务中。未来我们会在相关方面进行更多尝试和探索,例如考虑加入笔画对单字构造建模,以期借此提升错别字的相似匹配效果;考虑采用GCN/GraphSAGE/GAT等图神经网络建模,以期提升词嵌入质量等,也欢迎业内同学提供更多思路,批评指正。
参考文献:[1] Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155.
[2] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119.
[3] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[4] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[J]. arXiv preprint arXiv:1802.05365, 2018.
[5] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[6] Cao S, Lu W, Zhou J, et al. cw2vec: Learning chinese word embeddings with stroke n-gram information[C]//Thirty-second AAAI conference on artificial intelligence. 2018.
[7] Yin R, Wang Q, Li P, et al. Multi-granularity chinese word embedding[C]//Proceedings of the 2016 conference on empirical methods in natural language processing. 2016: 981-986.
[8] Xu J, Liu J, Zhang L, et al. Improve chinese word embeddings by exploiting internal structure[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016: 1041-1050.
[9] Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification[J]. arXiv preprint arXiv:1607.01759, 2016.
[10] Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations[C]//Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014: 701-710.
[11] Grover A, Leskovec J. node2vec: Scalable feature learning for networks[C]//Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016: 855-864.
[12] Tang J, Qu M, Wang M, et al. Line: Large-scale information network embedding[C]//Proceedings of the 24th international conference on world wide web. 2015: 1067-1077.
[13] Dong Y, Chawla N V, Swami A. metapath2vec: Scalable representation learning for heterogeneous networks[C]//Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017: 135-144.
看腾讯技术,学云计算知识,就来云+社区: https://cloud.tencent.com/developer