Scrapy爬虫开发流程概要如下:
- 创建一个新的Scrapy项目:
scrapy startproject myproject
- 定义Item:
import scrapy
class MyItem(scrapy.Item):
# 定义字段
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
- 编写爬虫(Spider):
import scrapy
from myproject.items import MyItem
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['http://example.com/']
def parse(self, response):
for item in response.css('div.product'):
product = MyItem()
product['name'] = item.css('div.name ::text').extract_first()
product['price'] = item.css('div.price ::text').extract_first()
product['stock'] = item.css('div.stock ::text').extract_first()
yield product
next_page_url = response.css('div.pager a.next ::attr(href)').extract_first()
if next_page_url is not None:
yield response.follow(next_page_url, self.parse)
- 编写Item Pipeline:
class MyPipeline(object):
def process_item(self, item, spider):
# 处理item,例如保存到数据库
return item
- 设置settings.py:
ITEM_PIPELINES = {
'myproject.pipelines.MyPipeline': 300,
}
- 运行爬虫:
scrapy crawl my_spider
以上步骤提供了一个简化的Scrapy爬虫开发流程概览,实际开发中可能需要根据具体网站的结构和需求进行更复杂的操作,如添加headers、cookies处理、动态网页的解析、用户代理(User-Agent)轮换、反爬虫策略等。