爬虫简介与requests模块 (2)

我们也可以在程序中指定代理来进行http或https访问(使用proxies关键字参数),

proxies = { \'http\': \'http://10.10.1.10:3128", "https": "http://10.10.1.10:1080", } requests.get("http://baidu.com", proxies=proxies)

关于session

有时候会有这样的情况,我们需要登录某个网站,然后才能请求相关url,这时就可以用到session了,我们可以先使用网站的登录api进行登录,然后得到session,最后就可以用这个session来请求其他url了:

s=requests.Session() login_data={``\'form_email\'``:``\'youremail@example.com\'``,``\'form_password\'``:``\'yourpassword\'``} s.post(``"http://baidu.com/testLogin"``,login_data) r = s.get(``\'http://baidu.com/notification/\'``) print r.text

其中,form_email和form_password是豆瓣登录框的相应元素的name值

实例

使用requests+re+process来爬取豆瓣,

import requests import re import json from multiprocessing import Process import time def get_page(url): """ 发送请求,获取数据 :param url: :return: """ response = requests.get(url) return response def parser(res): """ 解析数据 :param res: :return: """ reg = re.compile( \'<div>.*?<a href="http://www.likecs.com/(?P<url>.*?)">.*?<span>(?P<title>.*?)\' \'</span>.*?<span.*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)人评价</span>\', re.S) ret_iter = reg.finditer(res.text) return ret_iter def store(ret_iter): """ 存储数据 :param ret_iter: :return: """ lis = [] for i in ret_iter: dic = {} dic[\'url\'] = i.group(\'url\') dic[\'title\'] = i.group(\'title\') dic[\'rating_num\'] = i.group(\'rating_num\') dic[\'comment_num\'] = i.group(\'comment_num\') lis.append(dic) with open(\'douban_re.txt\', \'a\', encoding=\'utf8\') as f: for i in lis: f.write(json.dumps(i, ensure_ascii=False) + \'\n\') def spider_movie(url): res = get_page(url) ret_iter = parser(res) store(ret_iter) if __name__ == \'__main__\': start = time.time() p_list = [] for i in range(10): url = "https://movie.douban.com/top250?start=%s&filter=" % i * 25 p = Process(target=spider_movie, args=(url,)) p.start() p_list.append(p) for i in p_list: i.join() print(time.time() - start)

githup页面,模拟登录,爬取登录页面

import requests import re # 第一步: 请求获取token,以便通过post请求校验 # session=requests.session() res = requests.get("https://github.com/login") authenticity_token = re.findall(\'name="authenticity_token" value="(.*?)"\', res.text)[0] print(authenticity_token) # 第二步 构建post请求数据 data = { "login": "yuanchenqi0316@163.com", "password": "yuanchenqi0316", "commit": "Sign in", "utf8": "✓", "authenticity_token": authenticity_token } headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" } res = requests.post("https://github.com/session", data=data, headers=headers, cookies=res.cookies.get_dict()) with open("github.html", "wb") as f: f.write(res.content)

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zgyjwj.html