scrapy初步-简单静态爬虫(爬取电影天堂所有电影)
以下是一个简单的Scrapy爬虫示例,用于爬取电影天堂网站的电影信息。请注意,实际爬虫应该遵守robots.txt协议,并使用合理的间隔时间进行爬取,避免对服务器造成过大压力。
首先,创建一个新的Scrapy项目:
scrapy startproject movie_spider
然后,定义爬虫的items.py:
# movie_spider/items.py
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
link = scrapy.Field()
info = scrapy.Field()
size = scrapy.Field()
magnet = scrapy.Field()
接下来,编写爬虫的spider:
# movie_spider/spiders/movie_spider.py
import scrapy
from movie_spider.items import MovieItem
class MovieSpider(scrapy.Spider):
name = 'movie_spider'
allowed_domains = ['www.dy2018.com']
start_urls = ['http://www.dy2018.com/']
def parse(self, response):
# 提取电影信息的链接
movie_links = response.css('div.index_movies ul li a::attr(href)').extract()
for link in movie_links:
movie_url = response.urljoin(link)
yield scrapy.Request(movie_url, callback=self.parse_movie)
# 提取分页信息,这里简化处理,只爬取第一页
# next_page_url = response.css('a.next_page::attr(href)').extract_first()
# if next_page_url:
# yield response.follow(next_page_url, self.parse)
def parse_movie(self, response):
item = MovieItem()
# 提取电影名称
item['name'] = response.css('div.content_info_top h1::text').extract_first()
# 提取详细信息
info = response.css('div.content_info p::text').extract()
item['info'] = ''.join(info).strip()
# 提取下载链接
download_links = response.css('div.downlist a::attr(href)').extract()
for link in download_links:
if 'magnet' in link:
item['magnet'] = link
break
# 提取种子信息
item['size'] = response.css('div.downlist a::text').extract_first()
# 提交item到管道
yield item
最后,设置items.py中定义的字段,并保存爬取的数据。
确保你的Scrapy项目的settings.py
文件中启用了item pipelines来保存数据:
# movie_spider/pipelines.py
class MoviePipeline(object):
def process_item(self, item, spider):
# 这里可以添加保存数据到数据库或文件的代码
with open('movies.json', 'a+') as f:
f.write(json.dumps(dict(item), ensure_ascii=False) + '\n')
return item
并在\`se
评论已关闭