用Requests和bs4爬虫小说

作者：System 时间：2024年08月17日分类：所有,爬虫字数：1233

这篇文章距离上次修改已过356天，其中的内容可能已经有所变动。

使用Requests和BeautifulSoup（bs4）爬取小说的基本步骤如下：

确定目标网站的小说章节列表URL。
发送HTTP请求获取页面内容。
使用BeautifulSoup解析页面，提取小说章节标题和内容。
循环遍历每个章节，重复步骤2和3。
保存小说内容到本地文件。

以下是一个简单的示例代码，展示如何使用Requests和BeautifulSoup爬取一个假设的小说网站：




import requests
from bs4 import BeautifulSoup
 
# 目标小说章节列表URL
url = 'http://example.com/novel/chapter-list'
 
# 发送HTTP请求
response = requests.get(url)
response.raise_for_status()  # 检查请求是否成功
 
# 解析页面
soup = BeautifulSoup(response.text, 'html.parser')
 
# 提取章节链接
chapters = soup.find_all('a', class_='chapter-link')
 
# 循环遍历章节，并保存小说内容
for chapter in chapters:
    chapter_url = chapter['href']
    chapter_title = chapter.text
    
    # 发送HTTP请求获取章节内容
    chapter_response = requests.get(chapter_url)
    chapter_response.raise_for_status()
    
    # 解析章节内容
    chapter_soup = BeautifulSoup(chapter_response.text, 'html.parser')
    chapter_content = chapter_soup.find('div', class_='chapter-content').get_text()
    
    # 保存小说内容
    with open(f'{chapter_title}.txt', 'w', encoding='utf-8') as file:
        file.write(chapter_content)
        print(f'Chapter "{chapter_title}" has been saved.')

请注意，实际使用时需要根据目标网站的HTML结构进行相应的调整。同时，应遵守网站的爬虫政策，避免对网站服务器造成过大压力，并且在保存内容时要考虑文件命名的唯一性以及字符编码问题。

用Requests和bs4爬虫小说

评论已关闭

推荐阅读