【Python】词频统计

需求:一篇文章,出现了哪些词?哪些词出现得最多?

英文文本词频统计

英文文本:Hamlet 分析词频

统计英文词频分为两步:

文本去噪及归一化

使用字典表达词频

代码:

#CalHamletV1.py def getText(): txt = open("hamlet.txt", "r").read() txt = txt.lower() for ch in \'!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~\': txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格 return txt hamletTxt = getText() words = hamletTxt.split() counts = {} for word in words: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))

运行结果:

the 1138 and 965 to 754 of 669 you 550 i 542 a 542 my 514 hamlet 462 in 436 中文文本词频统计

中文文本:《三国演义》分析人物

统计中文词频分为两步:

中文文本分词

使用字典表达词频

#CalThreeKingdomsV1.py import jieba txt = open("threekingdoms.txt", "r", encoding=\'utf-8\').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(15): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count))

运行结果:

曹操 953 孔明 836 将军 772 却说 656 玄德 585 关公 510 丞相 491 二人 469 不可 440 荆州 425 玄德曰 390 孔明曰 390 不能 384 如此 378 张飞 358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步:

中文文本分词

使用字典表达词频

扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py import jieba excludes = {"将军","却说","荆州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding=\'utf-8\').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word counts[rword] = counts.get(rword,0) + 1 for word in excludes: del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count)) 考研英语词频统计

将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。
文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py def getText(): txt = open("86_17_1_2.txt", "r").read() txt = txt.lower() for ch in \'!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~\': txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格 return txt pyTxt = getText() #获得没有任何标点的txt文件 words = pyTxt.split() #获得单词 counts = {} #字典,键值对 excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",\ "was", "are", "have", "were", "had", "that", "for", "it",\ "on", "be", "as", "with", "by", "not", "their", "they",\ "from", "more", "but", "or", "you", "at", "has", "we", "an",\ "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",\ "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", \ "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", \ "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", \ "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", \ "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", \ "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", \ "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", \ "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68", "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"} for word in words: if word not in excludes: counts[word] = counts.get(word,0) + 1 items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10): word, count = items[i] print ("{0:<10}{1:>5}".format(word, count)) x = len(counts) print(x) r = 0 next = eval(input("1继续")) while next == 1: r += 100 for i in range(r, r+100): word, count = items[i] print ("\"{}\"".format(word), end = ", ") next = eval(input("1继续"))

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zgfgzj.html