python:最简单爬虫之使用Scrapy框架爬取小说

作者：System 时间：2024年08月10日分类：所有,爬虫字数：878

这篇文章距离上次修改已过332天，其中的内容可能已经有所变动。




import scrapy
 
class MySpider(scrapy.Spider):
    name = 'novel_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/novels']
 
    def parse(self, response):
        # 提取每本小说的链接
        novel_urls = response.css('a.novel_link::attr(href)').getall()
        for url in novel_urls:
            yield response.follow(url, self.parse_novel)
 
    def parse_novel(self, response):
        # 提取小说章节标题和内容
        title = response.css('h1.novel_title::text').get()
        content = response.css('div.novel_content').get()
        yield {
            'title': title,
            'content': content,
        }

这个简单的爬虫示例展示了如何使用Scrapy框架来定义一个爬虫，提取网页中的链接，并对每个小说页面进行解析。在parse_novel方法中，它提取了小说的标题和内容，并生成了一个包含这些信息的字典。这个爬虫的名字是novel_spider，允许爬取的域名是example.com，起始URL是http://example.com/novels。在实际应用中，你需要根据目标网站的结构来调整CSS选择器。

python:最简单爬虫之使用Scrapy框架爬取小说

评论已关闭

推荐阅读