## 爬虫入门之Scrapy框架基础rule与LinkExtractors(十一) (2)

日期：2021-05-21 栏目：程序人生浏览：次

Link Extractors要实例化一次，并且 extract_links 方法会根据不同的 response 调用多次提取链接｡

class scrapy.linkextractors.LinkExtractor( allow = (), deny = (), allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), tags = ('a','area'), attrs = ('href'), canonicalize = True, unique = True, process_value = None )

主要参数：

allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。

deny：与这个正则表达式(或正则表达式列表)匹配的URL一定不提取。

allow_domains：会被提取的链接的domains。

deny_domains：一定不会被提取链接的domains。

restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

4 rules:适合全站爬取

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

class scrapy.spiders.Rule( link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None )

link_extractor：是一个Link Extractor对象，用于定义需要提取的链接。

callback：从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。

注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。如果callback为None，follow 默认设置为True ，否则默认为False。

process_links：指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。

process_request：指定该spider中哪个的函数将会被调用，该规则提取到每个request时都会调用该函数。 (用来过滤request)

5 爬取规则(Crawling rules)

继续用腾讯招聘为例，给出配合rule使用CrawlSpider的例子:

首先运行

scrapy shell "http://hr.tencent.com/position.php?&start=0#a"

导入LinkExtractor，创建LinkExtractor实例对象。：

from scrapy.linkextractors import LinkExtractor

page_lx = LinkExtractor(allow=('position.php?&start=\d+'))

> allow : LinkExtractor对象最重要的参数之一，这是一个正则表达式或正则表达式列表，必须要匹配这个正则表达式(或正则表达式列表)的URL才会被提取，如果没有给出(或为空), 它会匹配所有的链接｡ > deny : 用法同allow，只不过与这个正则表达式匹配的URL不会被提取)｡它的优先级高于 allow 的参数，如果没有给出(或None), 将不排除任何链接｡ 3. 调用LinkExtractor实例的extract_links()方法查询匹配结果： page_lx.extract_links(response)

没有查到：

[]

注意转义字符的问题，继续重新匹配：

page_lx = LinkExtractor(allow=('position\.php\?&start=\d+')) # page_lx = LinkExtractor(allow = ('start=\d+')) page_lx.extract_links(response)

## CrawlSpider 版本那么，scrapy shell测试完成之后，修改以下代码 ```python #提取匹配 'http://hr.tencent.com/position.php?&start=\d+'的链接 page_lx = LinkExtractor(allow = ('start=\d+')) rules = [ #提取匹配,并使用spider的parse方法进行分析;并跟进链接(没有callback意味着follow默认为True) Rule(page_lx, callback = 'parse', follow = True) ]

这么写对吗？

不对！千万记住 callback 千万不能写 parse，再次强调：由于CrawlSpider使用parse方法来实现其逻辑，因此回调函数必须保证不能与CrawlSpider中parse方法重名 , 如果覆盖了 parse方法，crawl spider将会运行失败。

# -*- coding: utf-8 -*- import re import scrapy from scrapy.spiders import CrawlSpider, Rule # 提取超链接的规则 from scrapy.linkextractors import LinkExtractor # 提取超链接 from Tencent import items class MytencentSpider(CrawlSpider): name = 'myTencent' allowed_domains = ['hr.tencent.com'] start_urls = ['https://hr.tencent.com/position.php?lid=2218&start=0#a'] page_lx = LinkExtractor(allow=("start=\d+")) rules = [ Rule(page_lx, callback="parseContent", follow=True) ] # parse(self, response) def parseContent(self, response): for data in response.xpath("//tr[@class=\"even\"] | //tr[@class=\"odd\"]"): item = items.TencentItem() item["jobTitle"] = data.xpath("./td[1]/a/text()")[0].extract() item["jobLink"] = "https://hr.tencent.com/" + data.xpath("./td[1]/a/@href")[0].extract() item["jobCategories"] = data.xpath("./td[1]/a/text()")[0].extract() item["number"] = data.xpath("./td[2]/text()")[0].extract() item["location"] = data.xpath("./td[3]/text()")[0].extract() item["releasetime"] = data.xpath("./td[4]/text()")[0].extract() yield item # for i in range(1, 200): # newurl = "https://hr.tencent.com/position.php?lid=2218&start=%d#a" % (i*10) # yield scrapy.Request(newurl, callback=self.parse)

运行：scrapy crawl tencent

6 robots协议

转载注明出处：https://www.heiqu.com/wpgzzx.html

## 爬虫入门之Scrapy框架基础rule与LinkExtractors(十一) (2)

相关推荐