智一面的面试题提供python的测试题
使用地址:http://www.gtalent.cn/exam/interview?token=906315a76b5c14231889351088713f76
首先安装第三方包:Beautifulsoup
pip install beautifulsoup4
最好是使用虚拟环境运行代码,方便管理,也不会出现第三方包之间的版本冲突,省去一些小麻烦
此次是基于python标准库之urlib库运行的,当然后面肯定是会使用requests库运行的,毕竟优秀的东西大家都喜欢......
爬取豆瓣网站:
import urllib.request
from bs4 import BeautifulSoup
url = "https://movie.douban.com/chart"
response = urllib.request.urlopen(url)
html = response.read().decode('utf8')
bs = BeautifulSoup(html, "html.parser")
print(bs.body)
上述代码看似没有问题,但是在运行的时候报错如下:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)>
从报错上看,是ssl证书导致的报错;查阅资料后,大家的解决都是把全局的ssl紧掉......
增加代码后如下显示:
import ssl
import urllib.request
from bs4 import BeautifulSoup
url = "https://movie.douban.com/chart"
ssl._create_default_https_context = ssl._create_unverified_context
response = urllib.request.urlopen(url)
html = response.read().decode('utf8')
bs = BeautifulSoup(html, "html.parser")
print(bs.body)
导入ssl库
取消全局证书的验证
ssl._create_default_https_context = ssl._create_unverified_context是取消全局证书的验证
把私有函数_create_unverified_context赋值给_create_default_https_context,算是强制取消,又是私有函数调用,个人觉得是听不舒服的,但目前这个urlib也没有好的解决方法,除非更换为requests库......
看似没问题的代码再次运行,又出现报错如下:
urllib.error.HTTPError: HTTP Error 418:
这个问题查阅了资料,是因为网站做了反爬机制导致的,设置一个user-agent即可...
然后发现urlopen没有传headers头的入口
然后就是增加请求方法,把headers头放入请求方法中,request返回对象传给urlopen:
import ssl
import urllib.request
from bs4 import BeautifulSoup
url = "https://movie.douban.com/chart"
headers = {"user-agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
ssl._create_default_https_context = ssl._create_unverified_context
url_obj = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(url_obj)
html = response.read().decode('utf8')
bs = BeautifulSoup(html, "html.parser")
print(bs.body)
html.parser是python3中的一个解析器,还有lxml和html5lib解析器;lxml要pip进行安装
bs.body是获取html的body中的数据,还可以bs.html、bs.h1、bs.title等...
运行后的部分结果如下展示:
<h2>分类排行榜 · · · · · ·<img src="https://img3.doubanio.com/f/shire/e49eca1517424a941871a2667a8957fd6c72d632/pics/new_menu.gif" style=" position: absolute;"/></h2>
<div class="types">
<span><a href="/typerank?type_name=剧情&type=11&interval_id=100:90&action=">剧情</a></span>
<span><a href="/typerank?type_name=喜剧&type=24&interval_id=100:90&action=">喜剧</a></span>
<span><a href="/typerank?type_name=动作&type=5&interval_id=100:90&action=">动作</a></span>
<span><a href="/typerank?type_name=爱情&type=13&interval_id=100:90&action=">爱情</a></span>
<span><a href="/typerank?type_name=科幻&type=17&interval_id=100:90&action=">科幻</a></span>
<span><a href="/typerank?type_name=动画&type=25&interval_id=100:90&action=">动画</a></span>
<span><a href="/typerank?type_name=悬疑&type=10&interval_id=100:90&action=">悬疑</a></span>
<span><a href="/typerank?type_name=惊悚&type=19&interval_id=100:90&action=">惊悚</a></span>
<span><a href="/typerank?type_name=恐怖&type=20&interval_id=100:90&action=">恐怖</a></span>
<span><a href="/typerank?type_name=纪录片&type=1&interval_id=100:90&action=">纪录片</a></span>
<span><a href="/typerank?type_name=短片&type=23&interval_id=100:90&action=">短片</a></span>
<span><a href="/typerank?type_name=情色&type=6&interval_id=100:90&action=">情色</a></span>
<span><a href="/typerank?type_name=同性&type=26&interval_id=100:90&action=">同性</a></span>
<span><a href="/typerank?type_name=音乐&type=14&interval_id=100:90&action=">音乐</a></span>
<span><a href="/typerank?type_name=歌舞&type=7&interval_id=100:90&action=">歌舞</a></span>
<span><a href="/typerank?type_name=家庭&type=28&interval_id=100:90&action=">家庭</a></span>
<span><a href="/typerank?type_name=儿童&type=8&interval_id=100:90&action=">儿童</a></span>
<span><a href="/typerank?type_name=传记&type=2&interval_id=100:90&action=">传记</a></span>
<span><a href="/typerank?type_name=历史&type=4&interval_id=100:90&action=">历史</a></span>
<span><a href="/typerank?type_name=战争&type=22&interval_id=100:90&action=">战争</a></span>
<span><a href="/typerank?type_name=犯罪&type=3&interval_id=100:90&action=">犯罪</a></span>
<span><a href="/typerank?type_name=西部&type=27&interval_id=100:90&action=">西部</a></span>
<span><a href="/typerank?type_name=奇幻&type=16&interval_id=100:90&action=">奇幻</a></span>
<span><a href="/typerank?type_name=冒险&type=15&interval_id=100:90&action=">冒险</a></span>
<span><a href="/typerank?type_name=灾难&type=12&interval_id=100:90&action=">灾难</a></span>
<span><a href="/typerank?type_name=武侠&type=29&interval_id=100:90&action=">武侠</a></span>
<span><a href="/typerank?type_name=古装&type=30&interval_id=100:90&action=">古装</a></span>
<span><a href="/typerank?type_name=运动&type=18&interval_id=100:90&action=">运动</a></span>
<span><a href="/typerank?type_name=黑色电影&type=31&interval_id=100:90&action=">黑色电影</a></span>
</div>
</div>
<!-- douban ad begin -->
<div id="dale_movie_chart_top_right"></div>
<!-- douban ad end -->
<div class="movie_top" id="ranking">
<div class="movie_top" id="ranking">
<h2>一周口碑榜· · · · · · <span class="box_chart_num color-gray">5月7日 更新</span></h2>
<ul class="content" id="listCont2">
<li class="clearfix">
<div class="no">1</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/35417202/" onclick="moreurl(this, {from:'mv_week'})">
地球改变之年
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">2</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/34679722/" onclick="moreurl(this, {from:'mv_week'})">
女人
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">3</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/34429100/" onclick="moreurl(this, {from:'mv_week'})">
音乐
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">4</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/34845781/" onclick="moreurl(this, {from:'mv_week'})">
鬼灭之刃 剧场版 无限列车篇
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">5</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/35287558/" onclick="moreurl(this, {from:'mv_week'})">
有答案的男子
</a>
————————————————
我们的python技术交流群:941108876
智一面的面试题提供python的测试题
使用地址“http://www.gtalent.cn/exam/interview?token=8a33fabdc405d59c90ffca2496195543