简单的Python爬虫 (3)

整体案例

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse\'s story</title></head> <body> <p><b>The Dormouse\'s story</b></p> <p>Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie">Elsie</a>, <a href="http://example.com/lacie">Lacie</a> and <a href="http://example.com/tillie">Tillie</a>; and they lived at the bottom of a well.</p> <p>...</p> """ soup = BeautifulSoup(html_doc,\'html.parser\',from_encoding=\'utf-8\') print(\'获取所有的链接\') links = soup.find_all(\'a\') for link in links: print (link.name,link[\'href\'],link.get_text()) print(\'获取lacie的来接\') link_node1 = soup.find(\'a\',href=\'http://example.com/lacie\') print (link_node1.name,link_node1[\'href\'],link_node1.get_text()) #模糊匹配ill import re print(\'正则匹配\') link_node2 = soup.find(\'a\',href =http://www.likecs.com/re.compile(r"ill")) print (link_node2.name,link_node2[\'href\'],link_node2.get_text()) #指定class获取他的内容 print(\'获取p段落文字\') p_node = soup.find(\'p\',class_ = "tittle") print (p_node.name,p_node.get_text()) 爬虫开发实例

1.确定目标:确定要爬取的网页信息
2.分析目标:确定url格式,限定抓取信息的范围
3.分析抓取数据的格式
4.分析页面的编码,指定网页编码

5.编写代码
6.执行爬虫

这里写图片描述

目标:百度百科Python词条相关1000词条网页–标题和简介

分析数据

入口页:

https://baike.baidu.com/item/Python/407313.html

URL格式:

-词条页面URL:/view/125370.html

数据格式

-标题: <dd class="lemmaWgt-lemmaTitle-title"><h1>***</h1></dd> -简介: <div class="lemma-summary">***<div>

页面编码:

UTF8

源码
目录结构

spider_main(爬虫调度器)

html_downloader(网页下载)

url_manager(url管理)

html_parser(网页解析器)

html_outputer(所有爬去好的数据)

spider_main.py

# conding:utf-8 from baidubk import url_manager, html_downloader, html_outputer, html_parser class SpiderMain(object): def __init__(self): # 引入模块 self.urls = url_manager.UrlManager() self.downloader = html_downloader.HtmlDownloader() self.parser = html_parser.HtmlParser() self.outputer = html_outputer.HtmlOutput() def craw(self, root_url): count = 1 self.urls.add_new_url(root_url) # 当有待爬去url时: while self.urls.has_new_url(): try: new_url = self.urls.get_new_url() print(\'craw %d : %s\' % (count, new_url)) # 启动下载器 下载到download html_cont = self.downloader.download(new_url) print(html_cont) # 解析器解析这个页面 得到新的url列表 print(\'解析器解析这个页面 得到新的url列表\') new_urls, new_data = self.parser.parse(new_url, html_cont) self.urls.add_new_urls(new_urls) self.outputer.collect_data(new_data) if count == 10: break else: pass count = count + 1 except: print(\'craw failed\') self.outputer.output_html() # 编写main函数 if __name__ == "__main__": root_url = "https://baike.baidu.com/item/Python/407313.html" obj_spider = SpiderMain() # craw方法启动爬虫 obj_spider.craw(root_url)

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zzpdjz.html