Python爬虫之Scrapy框架系列(24)——分布式爬虫scrapy_redis完整实战【XXTop250完整爬取】
# 导入Scrapy相关组件
import scrapy
from scrapy_redis.spiders import RedisSpider
from scrapy_redis.items import RedisItem
class DoubanMovieSpider(RedisSpider):
name = 'douban_movie'
allowed_domains = ['movie.douban.com']
redis_key = 'douban_movie:start_urls'
def parse(self, response):
movie_list = response.css('.grid_view li')
for movie in movie_list:
title = movie.css('.title::text').extract_first()
rating = movie.css('.rating_num::text').extract_first()
link = movie.css('a::attr(href)').extract_first()
cover = movie.css('.cover::attr(src)').extract_first()
item = RedisItem()
item['title'] = title
item['rating'] = rating
item['link'] = link
item['cover'] = cover
yield item
next_page = response.css('.paginator a.next::attr(href)').extract_first()
if next_page:
yield response.follow(next_page, self.parse)
这个实战代码实现了一个简单的基于Scrapy-Redis的分布式电影评分TOP250的爬虫。它使用了RedisSpider
类替代了scrapy.Spider
类,并且定义了一个parse
方法来解析电影信息,同时实现了分页的处理。这个代码示例是分布式爬虫的入门级别,适合学习如何利用Scrapy-Redis进行有效的网络数据爬取。
评论已关闭