Python爬虫初识

本文章是对网易云课堂中的Python网络爬虫实战课程进行总结。感兴趣的朋友可以观看视频课程。课程地址

爬虫简介 一段自动抓取互联网信息的程序 非结构化数据 没有固定的数据格式,如网页资料。 必须通过ETL(Extract,Transformation,Loading)工具将数据转化为结构化数据才能使用。 工具安装

Anaconda

pip install requests pip install BeautifulSoup4 pip install jupyter

打开jupyter

jupyter notebook requests 网络资源截取插件 取得页面 import requests url = '' res = requests.get(url) res.encoding = 'utf-8' print (res.text) 将网页读进BeautifulSoup中 from bs4 import BeautifulSoup soup = BeautifulSoup(res.text, 'html.parser') print (soup.text) 使用select方法找找出特定标签的HTML元素,可取标签名或id,class返回的值是一个list select('h1') select('a') id = 'thehead' select('#thehead') alink = soup.select('a') for link in alink: print (link['href']) 例子

1、取得新浪陕西的新闻时间标题和连接 import requests from bs4 import BeautifulSoup res = requests.get('http://sx.sina.com.cn/') res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') for newslist in soup.select('.news-list.cur'): for news in newslist: for li in news.select('li'): title = li.select('h2')[0].text href = li.select('a')[0]['href'] time = li.select('.fl')[0].text print (time, title, href)

2、获取文章的标题,来源,时间和正文 import requests from bs4 import BeautifulSoup from datetime import datetime res = requests.get('http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew5095240.shtml') res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') h1 = soup.select('h1')[0].text source = soup.select('.source-time span span')[0].text timesource = soup.select('.source-time')[0].contents[0].text date = datetime.strptime(timesource, '%Y-%m-%d %H:%M') article = [] for p in soup.select('.article-body p')[:-1]: article.append(p.text.strip()) ' '.join(article)

简写为:

' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])

说明:

datatime 包用来格式化时间 [:-1]去除最后一个元素 strip() 移除字符串头尾指定的字符(默认为空格或换行符) ' '.join(article) 将列表以空格连接

3、获取文章的评论数,评论数是通过js写入,不能通过上面的方法获取到,在js下,找到文章评论的js import requests import json comments = requests.get('http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-hcikcew5095240:0') jd = json.loads(comments.text.strip('var data =')) jd['result']['count']['sx:comos-hcikcew5095240:0']['total']

4、将获得评论的方法总结成一个函数 import re import json commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0' def getCommentCounts(url): m = re.search('detail-i(.+).shtml' ,url) newsid = m.group(1) comments = requests.get(commenturl.format(newsid)) jd = json.loads(comments.text.strip('var data =')) return jd['result']['count']['sx:comos-'+newsid+':0']['total'] news = 'http://sx.sina.com.cn/news/b/2018-06-01/detail-ihcikcev8756673.shtml' getCommentCounts(news)

5、输入地址得到文章的所有信息(标题、时间、来源、正文等)的函数(完整版) import requests import json import re from bs4 import BeautifulSoup from datetime import datetime commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0' def getCommentCounts(url): m = re.search('detail-i(.+).shtml' ,url) newsid = m.group(1) comments = requests.get(commenturl.format(newsid)) jd = json.loads(comments.text.strip('var data =')) return jd['result']['count']['sx:comos-'+newsid+':0']['total'] def getNewsDetail(newsurl): result = {} res = requests.get(newsurl) res.encoding = 'utf-8' soup = BeautifulSoup(res.text, 'html.parser') result['title'] = soup.select('h1')[0].text result['newssource'] = soup.select('.source-time span span')[0].text timesource = soup.select('.source-time')[0].contents[0].text result['date'] = datetime.strptime(timesource, '%Y-%m-%d %H:%M') result['article'] = ' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]]) result['comments'] = getCommentCounts(newsurl) return result news = 'http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew8995238.shtml' getNewsDetail(news)

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zyzfsy.html