智一面的面试题提供python的测试题
使用地址:http://www.gtalent.cn/exam/interview?token=906315a76b5c14231889351088713f76

首先安装第三方包:Beautifulsoup

pip  install  beautifulsoup4
最好是使用虚拟环境运行代码,方便管理,也不会出现第三方包之间的版本冲突,省去一些小麻烦

 

此次是基于python标准库之urlib库运行的,当然后面肯定是会使用requests库运行的,毕竟优秀的东西大家都喜欢......

 

爬取豆瓣网站:

import urllib.request
from bs4 import BeautifulSoup
url = "https://movie.douban.com/chart"
response = urllib.request.urlopen(url)
html = response.read().decode('utf8')
bs = BeautifulSoup(html, "html.parser")
print(bs.body)
上述代码看似没有问题,但是在运行的时候报错如下:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)>
从报错上看,是ssl证书导致的报错;查阅资料后,大家的解决都是把全局的ssl紧掉......

 

增加代码后如下显示:

import ssl
import urllib.request
from bs4 import BeautifulSoup
url = "https://movie.douban.com/chart"
ssl._create_default_https_context = ssl._create_unverified_context
response = urllib.request.urlopen(url)
html = response.read().decode('utf8')
bs = BeautifulSoup(html, "html.parser")
print(bs.body)
导入ssl库
取消全局证书的验证
ssl._create_default_https_context = ssl._create_unverified_context是取消全局证书的验证
把私有函数_create_unverified_context赋值给_create_default_https_context,算是强制取消,又是私有函数调用,个人觉得是听不舒服的,但目前这个urlib也没有好的解决方法,除非更换为requests库......
看似没问题的代码再次运行,又出现报错如下:

urllib.error.HTTPError: HTTP Error 418:
这个问题查阅了资料,是因为网站做了反爬机制导致的,设置一个user-agent即可...

然后发现urlopen没有传headers头的入口

然后就是增加请求方法,把headers头放入请求方法中,request返回对象传给urlopen:

import ssl
import urllib.request
from bs4 import BeautifulSoup
url = "https://movie.douban.com/chart"
headers = {"user-agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
           }
ssl._create_default_https_context = ssl._create_unverified_context
url_obj = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(url_obj)
html = response.read().decode('utf8')
bs = BeautifulSoup(html, "html.parser")
print(bs.body)
html.parser是python3中的一个解析器,还有lxml和html5lib解析器;lxml要pip进行安装
bs.body是获取html的body中的数据,还可以bs.html、bs.h1、bs.title等...
运行后的部分结果如下展示:

<h2>分类排行榜 · · · · · ·<img src="https://img3.doubanio.com/f/shire/e49eca1517424a941871a2667a8957fd6c72d632/pics/new_menu.gif" style=" position: absolute;"/></h2>
<div class="types">
<span><a href="/typerank?type_name=剧情&amp;type=11&amp;interval_id=100:90&amp;action=">剧情</a></span>
<span><a href="/typerank?type_name=喜剧&amp;type=24&amp;interval_id=100:90&amp;action=">喜剧</a></span>
<span><a href="/typerank?type_name=动作&amp;type=5&amp;interval_id=100:90&amp;action=">动作</a></span>
<span><a href="/typerank?type_name=爱情&amp;type=13&amp;interval_id=100:90&amp;action=">爱情</a></span>
<span><a href="/typerank?type_name=科幻&amp;type=17&amp;interval_id=100:90&amp;action=">科幻</a></span>
<span><a href="/typerank?type_name=动画&amp;type=25&amp;interval_id=100:90&amp;action=">动画</a></span>
<span><a href="/typerank?type_name=悬疑&amp;type=10&amp;interval_id=100:90&amp;action=">悬疑</a></span>
<span><a href="/typerank?type_name=惊悚&amp;type=19&amp;interval_id=100:90&amp;action=">惊悚</a></span>
<span><a href="/typerank?type_name=恐怖&amp;type=20&amp;interval_id=100:90&amp;action=">恐怖</a></span>
<span><a href="/typerank?type_name=纪录片&amp;type=1&amp;interval_id=100:90&amp;action=">纪录片</a></span>
<span><a href="/typerank?type_name=短片&amp;type=23&amp;interval_id=100:90&amp;action=">短片</a></span>
<span><a href="/typerank?type_name=情色&amp;type=6&amp;interval_id=100:90&amp;action=">情色</a></span>
<span><a href="/typerank?type_name=同性&amp;type=26&amp;interval_id=100:90&amp;action=">同性</a></span>
<span><a href="/typerank?type_name=音乐&amp;type=14&amp;interval_id=100:90&amp;action=">音乐</a></span>
<span><a href="/typerank?type_name=歌舞&amp;type=7&amp;interval_id=100:90&amp;action=">歌舞</a></span>
<span><a href="/typerank?type_name=家庭&amp;type=28&amp;interval_id=100:90&amp;action=">家庭</a></span>
<span><a href="/typerank?type_name=儿童&amp;type=8&amp;interval_id=100:90&amp;action=">儿童</a></span>
<span><a href="/typerank?type_name=传记&amp;type=2&amp;interval_id=100:90&amp;action=">传记</a></span>
<span><a href="/typerank?type_name=历史&amp;type=4&amp;interval_id=100:90&amp;action=">历史</a></span>
<span><a href="/typerank?type_name=战争&amp;type=22&amp;interval_id=100:90&amp;action=">战争</a></span>
<span><a href="/typerank?type_name=犯罪&amp;type=3&amp;interval_id=100:90&amp;action=">犯罪</a></span>
<span><a href="/typerank?type_name=西部&amp;type=27&amp;interval_id=100:90&amp;action=">西部</a></span>
<span><a href="/typerank?type_name=奇幻&amp;type=16&amp;interval_id=100:90&amp;action=">奇幻</a></span>
<span><a href="/typerank?type_name=冒险&amp;type=15&amp;interval_id=100:90&amp;action=">冒险</a></span>
<span><a href="/typerank?type_name=灾难&amp;type=12&amp;interval_id=100:90&amp;action=">灾难</a></span>
<span><a href="/typerank?type_name=武侠&amp;type=29&amp;interval_id=100:90&amp;action=">武侠</a></span>
<span><a href="/typerank?type_name=古装&amp;type=30&amp;interval_id=100:90&amp;action=">古装</a></span>
<span><a href="/typerank?type_name=运动&amp;type=18&amp;interval_id=100:90&amp;action=">运动</a></span>
<span><a href="/typerank?type_name=黑色电影&amp;type=31&amp;interval_id=100:90&amp;action=">黑色电影</a></span>
</div>
</div>
<!-- douban ad begin -->
<div id="dale_movie_chart_top_right"></div>
<!-- douban ad end -->
<div class="movie_top" id="ranking">
<div class="movie_top" id="ranking">
<h2>一周口碑榜· · · · · · <span class="box_chart_num color-gray">5月7日 更新</span></h2>
<ul class="content" id="listCont2">
<li class="clearfix">
<div class="no">1</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/35417202/" onclick="moreurl(this, {from:'mv_week'})">
                    地球改变之年
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">2</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/34679722/" onclick="moreurl(this, {from:'mv_week'})">
                    女人
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">3</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/34429100/" onclick="moreurl(this, {from:'mv_week'})">
                    音乐
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">4</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/34845781/" onclick="moreurl(this, {from:'mv_week'})">
                    鬼灭之刃 剧场版 无限列车篇
</a>
</div>
<span class="">
<div class="stay">0</div>
</span>
</li>
<li class="clearfix">
<div class="no">5</div>
<div class="name">
<a class="" href="https://movie.douban.com/subject/35287558/" onclick="moreurl(this, {from:'mv_week'})">
                    有答案的男子
</a>
                

————————————————
我们的python技术交流群:941108876
智一面的面试题提供python的测试题
使用地址“http://www.gtalent.cn/exam/interview?token=8a33fabdc405d59c90ffca2496195543