以下是一个使用Scrapy框架爬取当当网图片信息的示例代码:
首先,创建一个Scrapy项目:
scrapy startproject dangdang_images
cd dangdang_images
然后,定义Item来存储爬取的数据:
# items.py
import scrapy
class DangdangImageItem(scrapy.Item):
# 图片链接
image_urls = scrapy.Field()
# 图片下载后的保存路径
image_paths = scrapy.Field()
接着,编写爬虫:
# spiders/dangdang_spider.py
import scrapy
from ..items import DangdangImageItem
class DangdangSpider(scrapy.Spider):
name = 'dangdang'
allowed_domains = ['dangdang.com']
start_urls = ['http://category.dangdang.com/pg1-cid4002135.html']
def parse(self, response):
for product in response.css('ul.bigimg li'):
item = DangdangImageItem()
# 提取图片链接
image_urls = product.css('img::attr(src)').extract()
item['image_urls'] = image_urls
yield item
# 提取下一页链接并进行爬取
next_page = response.css('div.paging_next a::attr(href)').extract_first()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
最后,设置PIL和Scrapy图片管道以下载图片:
# pipelines.py
import os
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class DangdangImagesPipeline(ImagesPipeline):
def get_media_requests(self, item):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('All images failed downloading.')
item['image_paths'] = image_paths
return item
def file_path(self, request, response=None, info=None):
image_guid = request.url.split('/')[-1]
image_name = os.path.basename(image_guid)
return 'full/{0}'.format(image_name)
在settings.py
中启用和配置图片管道以及图片存储路径:
ITEM_PIPELINES = {
'dangdang_images.pipelines.DangdangImagesPipeline': 1,
}
IMAGES_STORE = 'path_to_your_images_directory'
以上代码实现了一个简单的Scrapy爬虫,用于爬取当当网产品页面的图片链接,并通过Scrapy的图片管道下载这些图片。这个案例展示了如何使用Scrapy爬取图片并保存到本地