Scrapy 爬虫使用指南完全教程(3)

日期：2020-06-15 栏目：程序人生浏览：次

from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.project import get_project_settings runner = CrawlerRunner(get_project_settings()) dfs = set() for domain in ['scrapinghub.com', 'insophia.com']: d = runner.crawl('followall', domain=domain) dfs.add(d) defer.DeferredList(dfs).addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished

避免被禁止(ban)

使用user agent池，轮流选择之一来作为user agent。池中包含常见的浏览器的user agent(google一下一大堆)

禁止cookies(参考 COOKIES_ENABLED)，有些站点会使用cookies来发现爬虫的轨迹。

设置下载延迟(2或更高)。参考 DOWNLOAD_DELAY 设置。

如果可行，使用 Google cache 来爬取数据，而不是直接访问站点。

使用IP池。例如免费的 Tor项目或付费服务(ProxyMesh)。

使用高度分布式的下载器(downloader)来绕过禁止(ban)，您就只需要专注分析处理页面。这样的例子有: Crawlera

增加并发 CONCURRENT_REQUESTS = 100

禁止cookies:COOKIES_ENABLED = False

禁止重试:RETRY_ENABLED = False

减小下载超时:DOWNLOAD_TIMEOUT = 15

禁止重定向:REDIRECT_ENABLED = False

启用 “Ajax Crawlable Pages” 爬取:AJAXCRAWL_ENABLED = True

对爬取有帮助的实用Firefox插件

Firebug

XPather

XPath Checker

Tamper Data

Firecookie

自动限速：AUTOTHROTTLE_ENABLED=True

转载注明出处：https://www.heiqu.com/5ba79f84ad89114534c6c13e0697f29e.html

Scrapy 爬虫 使用指南 完全教程(3)

相关推荐

Scrapy 爬虫使用指南完全教程(3)