一起学爬虫——urllib库常用方法用法总结

日期：2021-05-02 栏目：程序人生浏览：次

1、读取cookies

import http.cookiejar as cj,urllib.request as request cookie = cj.CookieJar() handler = request.HTTPCookieProcessor(cookie) opener = request.build_opener(handler) response = opener.open('http://www.bigdata17.com') for item in cookie: print(item.name + "=" + item.value)

2、将cookies保存在文件中

filename = 'baidu_cookies.txt' cookies = cj.MozillaCookieJar(filename) handler = request.HTTPCookieProcessor(cookies) opener = request.build_opener(handler) response = opener.open('http://www.baidu.com') cookies.save(ignore_discard=True,ignore_expires=True)

3、处理异常

URLError和HTTPError类，两个类是父子关系，HTTPError会返回错误代码，两个类都可以处理request模块产生的异常，这两个都有一个reason属性，用于记录出现异常的原因
URLError处理异常：

from urllib import request,error try: response = request.urlopen('http://www.bigdata17.com/index.htm') except error.URLError as e: print(e.reason)

HTTPError处理异常：
这个类是专门处理http请求的异常，http请求会返回一个请求代码，因此HTTPError会有一个code属性。另外HTTP请求会有包含请求头信息，所以HTTPError还包含一个headers属性。HTTPError继承自URLError类，因此也包含有reason属性。
代码：

try: response = request.urlopen('http://www.bigdata17.com/index.htm') except error.HTTPError as e: print(e.reason) print(e.code) print(e.headers)

4、解析链接
urllib库中的parse类提供了很多用于解析链接的方法。
urlparse()方法是专门用于解析链接的，我们先看这个方法的返回值：

from urllib.parse import urlparse result = urlparse('http://www.bigdata17.com') print(result)

上面的代码返回的结果：

ParseResult(scheme='http', netloc='www.bigdata17.com', path='', params='', query='', fragment='')

可见urlparse()方法返回的是ParseResult类，这个了有6个属性，分别是scheme、netloc、path、params、query和fragment。其中scheme代表的是协议，有http,https,ftp等协议类型。netloc是网站域名，path是要访问的网页名称。params是代表参数。query查询参数，fragment是锚点。

urlparse()方法是如何将一个链接映射到上面的6个参数中呢？
继续看下一段代码：

from urllib.parse import urlparse result = urlparse('http://www.bigdata17.com/22.html;user=bigdata17?id=10#content') print(result)

运行的结果如下：

ParseResult(scheme='http', netloc='www.bigdata17.com', path='/22.html', params='user=bigdata17', query='id=10', fragment='content')

可见从链接开始为://止，是scheme。从://开始到一个/位置，是netloc域名。从/开始到；分号为止是path，访问页面的路径。；开始到？为止是params参数。从？问号开始到#井号结束时query查询参数。最后是fragment锚点参数。

5、urlopen()方法
该方法返回的是HTTPResponse对象：

import urllib.request as request response = request.urlopen('http://www.bigdata17.com') print(response) <http.client.HTTPResponse object at 0x000002A9655BBF28>

HTTPResponse对象有read(),getheaders()等方法。

通过read()方法可以读取网页的信息：

import urllib.request as request response = request.urlopen('http://www.bigdata17.com') print(response.read().decode('utf-8'))

使用该方法时要注意网站使用的编码格式，配合decode()方法一起使用，否则会出现乱码。像百度用的是utf-8，网易用的是gbk。

getHeaders()方法返回的是网页的头信息：

import urllib.request as request response = request.urlopen('http://www.bigdata17.com') print(response.getheaders()) 结果： [('Server', 'nginx/1.12.2'), ('Date', 'Mon, 12 Nov 2018 15:45:22 GMT'), ('Content-Type', 'text/html'), ('Content-Length', '38274'), ('Last-Modified', 'Thu, 08 Nov 2018 00:35:52 GMT'), ('Connection', 'close'), ('ETag', '"5be384e8-9582"'), ('Accept-Ranges', 'bytes')]

继续看urlopen()方法有哪些参数：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

其中url是必须传递的参数，其他的参数不是必须传递的。data用于将数据传输到我们要爬取的网站上，例如用户名、密码、验证码等。timeout是设置请求超时时间。

data参数的用法：

>>> import urllib.parse as parse >>> import urllib.request as request >>> data = bytes(parse.urlencode({'username': 'bigdata17'}), encoding='utf8') >>> print(data) b'username=bigdata17' >>> response = request.urlopen('http://httpbin.org/post', data=data) >>> print(response.read()) b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "username ": "bigdata17"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "18", \n "Content-Type": "appl ication/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7"\n }, \n "json": null, \n "origin": "183.134.52.58", \n "url": "http://httpbin.org/post"\n}\n'

转载注明出处：https://www.heiqu.com/wsxgwd.html

一起学爬虫——urllib库常用方法用法总结

相关推荐