[root@localhost scrapy]# scrapy fetch --help Usage ===== scrapy fetch [options] <url> Fetch a URL using the Scrapy downloader and print its content to stdout. You may want to use --nolog to disable logging Options ======= --help, -h show this help message and exit --spider=SPIDER use this spider --headers print response HTTP headers instead of body Global Options -------------- --logfile=FILE log file. if omitted stderr will be used --loglevel=LEVEL, -L LEVEL log level (default: DEBUG) --nolog disable logging completely --profile=FILE write python cProfile stats to FILE --lsprof=FILE write lsprof profiling stats to FILE --pidfile=FILE write process ID to FILE --set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated)
根据命令提示,指定一个URL,执行后抓取一个网页的数据,如下所示:[root@localhost scrapy]# scrapy fetch > install.html 2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot) 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines: 2011-12-05 23:40:05+0800 [default] INFO: Spider opened 2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET > (referer: None) 2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished) 2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats: {'downloader/request_bytes': 227, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 22676, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833), 'scheduler/memory_enqueued': 1, 'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)} 2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished) 2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats: {'memusage/max': 17711104, 'memusage/startup': 17711104} [root@localhost scrapy]# ll install.html -rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html [root@localhost scrapy]#
可见,我们已经成功抓取了一个网页。