在Python爬虫Scrapy框架中,中间件是一种扩展机制,允许你自定义爬虫的请求和响应处理过程。
以下是一个简单的Scrapy中间件示例,用于限制爬虫的请求发送频率:
import random
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
 
class RandomUserAgentMiddleware(object):
    """
    随机更换请求的User-Agent
    """
    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent
 
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            user_agent=crawler.settings.get('USER_AGENT')
        )
 
    def process_request(self, request, spider):
        user_agent = random.choice(spider.settings['USER_AGENT_LIST'])
        request.headers.setdefault('User-Agent', user_agent)
 
class ProxyMiddleware(object):
    """
    代理IP中间件
    """
    def process_request(self, request, spider):
        proxy = spider.settings['PROXY']
        request.meta['proxy'] = proxy
 
class CustomDownloaderMiddleware(object):
    """
    自定义下载器中间件
    """
    def process_response(self, request, response, spider):
        # 自定义处理下载器得到的响应
        return response
 
class CustomRobotsMiddleware(RobotsTxtMiddleware):
    """
    自定义的Robots.txt中间件
    """
    def process_request(self, request, spider):
        # 自定义处理Robots.txt的逻辑
        return super(CustomRobotsMiddleware, self).process_request(request, spider)
在这个例子中,我们定义了三个中间件:RandomUserAgentMiddleware用于随机更换请求的User-Agent,ProxyMiddleware用于设置代理,CustomDownloaderMiddleware用于自定义处理响应。同时,我们还创建了一个CustomRobotsMiddleware来自定义处理Robots.txt的逻辑。
要在Scrapy中使用这些中间件,你需要在爬虫的settings.py文件中进行相应的配置。例如:
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomUserAgentMiddleware': 400,
    'myproject.middlewares.ProxyMiddleware': 410,
    'myproject.middlewares.CustomDownloaderMiddleware': 420,
    'myproject.middlewares.CustomRobotsMiddleware': 430,
}
 
USER_AGENT_LIST = [
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
    # ... 其他User-Agent字符串
]
 
PROXY = 'http://12.34.56.78:9010'
在这个配置中,每个中间件被赋予了一个唯一的优先级,数字越小表示优先级越高。USER_AGENT_LIST和PROXY也需要在settings.py中进行相应的配置。