利用TfidfVectorizer进行中文文本分类(数据集是复旦中文语料)
1、训练词向量
数据预处理参考利用TfidfVectorizer进行中文文本分类(数据集是复旦中文语料) ,现在我们有了分词后的train_jieba.txt和test_jieba.txt,看一下部分内容:
fenci_path = '/content/drive/My Drive/NLP/dataset/Fudan/train_jieba.txt' with open(fenci_path,'r',encoding='utf-8') as fp: i = 0 lines = fp.readlines() for line in lines: print(line) i += 1 if i == 10: break