python爬虫（7）——BeautifulSoup

日期：2021-06-21 栏目：程序人生浏览：次

　　　　今天介绍一个非常好用的python爬虫库——beautifulsoup4。beautifulsoup4的中文文档参考网址是：

　　　　首先使用pip安装这个库，当然还要用到lxml这个解析器，配合使用可以很方便的帮助我们处理html文档，提取所需要的信息。可以使用pip list命令查看你已经安装好的包。提醒大家注意一点！一定是pip install beautifulsoup4 ，这个4千万别忘记了，否则会出现如下报错信息：

　　　　　　print "Unit tests have failed!"

　　　　　　　　SyntaxError: Missing parentheses in call to 'print'

　　　　　　Command "python setup.py egg_info" failed with error code 1 in C:\Users\ADMINI~1\AppData\Local\Temp\pip-build-4g6q3fil\...

　　　　因为python中的print函数，在python3中是需要加括号的，所以我们可以知道报错是因为版本不兼容导致的。python3使用的beautifulsoup4，我之前安装就是出现了这个问题，好在很快发现了解决了。安装成功会出现successfully。

1 C:\Users\Administrator\Desktop 2 λ ipython 3 Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] 4 Type 'copyright', 'credits' or 'license' for more information 5 IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help. 6 #导入这个包 7 In [1]: from bs4 import BeautifulSoup 8 9 In [2]: html=http://www.likecs.com/"""\ 10 ...: <!DOCTYPE HTML> <html> <head> <meta charset="utf-8"> <title>我的博客(CCColby.com)</title> </head> <body> <video controls> <source src="m 11 ...: ovie.mp4" type="video/mp4"> <source src="http://www.likecs.com/movie.ogg" type="video/ogg"> 你的浏览器不支持 video 标签。 </video> </body> </html> 12 ...: """ 13 #创建对象，如果不指定解析方式，会出现警告 14 In [3]: soup=http://www.likecs.com/BeautifulSoup(html) 15 c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 16 17 The code that caused this warning is on line 193 of the file c:\users\administrator\appdata\local\programs\python\python36\lib\runpy.py. To get rid of this warning, change code that looks like this: 18 19 BeautifulSoup(YOUR_MARKUP}) 20 21 to this: 22 23 BeautifulSoup(YOUR_MARKUP, "lxml") 24 25 markup_type=http://www.likecs.com/markup_type)) 26 #我们制定解析方式为'lxml' 27 In [4]: soup=BeautifulSoup(html,"lxml") 28 #格式化输出soup对象 29 In [5]: print(soup.prettify()) 30 <!DOCTYPE HTML> 31 <html> 32 <head> 33 <meta charset=http://www.likecs.com/"utf-8"/> 34 <title> 35 我的博客(CCColby.com) 36 </title> 37 </head> 38 <body> 39 <video controls=http://www.likecs.com/"" height=http://www.likecs.com/"240" width=http://www.likecs.com/"320"> 40 <source src=http://www.likecs.com/"movie.mp4" type=http://www.likecs.com/"video/mp4"> 41 <source src=http://www.likecs.com/"movie.ogg" type=http://www.likecs.com/"video/ogg"> 42 你的浏览器不支持 video 标签。 43 </source> 44 </source> 45 </video> 46 </body> 47 </html>

转载注明出处：https://www.heiqu.com/zyzgpy.html

python爬虫（7）——BeautifulSoup

相关推荐