人人都爱python，用爬虫爬小说，按书名自动爬取生成.txt文件【gtalent】

1381 05-12

智一面的面试题提供python的测试题
使用地址：http://www.gtalent.cn/exam/interview?token=52cf92de494f4a8b6165d817a7279966

代码环境：
操作系统：ubuntu 16
python版本：3.5.2
适用的爬取网站：笔趣阁
今天上笔趣阁看了看，想找个小说，记得以前好像是有完本txt下载的。但是看了两个好像都没有提供下载，只能在线读了
所以写了个小爬虫。去爬取一下笔趣阁的小说，然后自动按书名生成对应的txt文件。并没有太大的实用价值，毕竟现在流量
便宜了，大家看个在线小说的流量还是有的。
用法：
如图片显示，先在笔趣阁找到一个感兴趣的小说，打开后如图所示，会在当前页面显示小说所有的章节信息。点击对应的链接
跳到指定的章节。
此爬虫原理就是循环获取章节的url地址，拼接读取小说内容，写入文件。
程序接收两个命令行参数，第一个参数为某本小说的url地址。例如截图中的《巅峰文明》的地址为：http://www.biquge.com.tw/19_19001/
则使用命令：python xxx.py http://www.biquge.com.tw/19_19001/ /home/xxx/xiaoshuo/
其中/home/xxx/xiaoshuo/为生成的txt文件保存的目录。此处只写目录即可，文件名默认使用小说名字。
爬取过程如下：
等待全部爬取完毕，打开生成的txt文件即可。
代码如下：

#coding=utf-8
import requests
import sys
import bs4
if len(sys.argv) < 3:
print("please enter the url, example: python3 test1.py http://xxx.xxx")
sys.exit()
#某本书的url地址
url = sys.argv[1]
#生成的文件文件保存的位置
directory = sys.argv[2]
print('start spider...')
#基础url地址，用作地址拼接，自动爬取所有章节的小说
baseurl = "http://www.biquge.com.tw"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'lxml')
#获取书名
bookname = soup.select('#info > h1')[0].getText()
#解析html结构，dl下边有N个dd，dd中的a标签的href属性保存了对应章节的具体url地址
chapters = soup.select('#list > dl > dd')
print(len(chapters))
for chapter in chapters:
atag = chapter.select('a')[0]
chapter_url = atag.get('href')
#拼接完整的url地址逐个去爬取数据
full_url = baseurl + chapter_url
article_res = requests.get(full_url)
article_soup = bs4.BeautifulSoup(article_res.text, 'lxml')
#查找小说内容的标签，小说内容存在id为content的div中
article_content = article_soup.select('#content')[0]
#查找小说的章节标题
article_title = article_soup.select('.bookname > h1')[0]
print('爬取:' + article_title.getText().encode('iso-8859-1').decode('gbk') + ' 的内容中...')
#以\r\n为标示分割一篇文章，将一章的小说内容分割为一个列表
lines = article_content.getText().split('\r\n')
print('爬取章节完成，开始写入文件...')
myfile = open(directory + "/" + bookname.encode('iso-8859-1').decode('gbk') + ".txt", 'a', encoding='iso-8859-1')
myfile.write(article_title.getText() + '\n')
for line in lines:
myfile.write(line.strip() + '\n')
myfile.write('\n\n\n')
print('章节写入文件完毕!')
myfile.close()

复制代码

: 我们的python技术交流群：941108876
智一面的面试题提供python的测试题
使用地址：http://www.gtalent.cn/exam/interview?token=9d06e75d818c9506d4309684d9637395

标签： python