Scrapy 爬虫 使用指南 完全教程(2)

>>>sel.xpath('//li//@href').extract() >>>sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()

XPATH TIPS

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Beware of the difference between //node[1] and (//node)[1]

When selecting by class, be as specific as necessary,When querying by class, consider using CSS

Learn to use all the different axes

Useful trick to get text content

Item Loaders populate items

def parse(self, response): l = ItemLoader(item=Product(), response=response) l.add_xpath('name', '//div[@class="product_name"]') l.add_xpath('name', '//div[@class="product_title"]') l.add_xpath('price', '//p[@id="price"]') l.add_css('stock', 'p#stock]') l.add_value('last_updated', 'today') # you can also use literal values return l.load_item()

Item Pipeline

清理HTML数据

验证爬取的数据(检查item包含某些字段)

查重(并丢弃)

将爬取结果保存到数据库中

编写你自己的item pipeline

每个item pipeline组件都需要调用该方法,这个方法必须返回一个 Item (或任何继承类)对象, 或是抛出 DropItem 异常,被丢弃的item将不会被之后的pipeline组件所处理。
参数:

item (Item 对象) – 被爬取的item

spider (Spider 对象) – 爬取该item的spider

Write items to MongoDB

import pymongo class MongoPipeline(object): def__init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): collection_name = item.__class__.__name__ self.db[collection_name].insert(dict(item)) return item

为了启用一个Item Pipeline组件,你必须将它的类添加到 ITEM_PIPELINES 配置,就像下面这个例子:

ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, }

分配给每个类的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。

实践经验 同一进程运行多个spider

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/5ba79f84ad89114534c6c13e0697f29e.html