利用Python队列生产者消费者模式构建高效爬虫

作者：System 时间：2024年08月07日分类：所有,爬虫字数：1146

这篇文章距离上次修改已过339天，其中的内容可能已经有所变动。




import queue
import threading
import requests
from bs4 import BeautifulSoup
 
# 初始化一个先进先出的队列
url_queue = queue.Queue()
 
def producer(url_queue, max_pages):
    """网页生产者，将需要爬取的URL放入队列"""
    for i in range(max_pages):
        url_queue.put(f'https://example.com/page/{i+1}')
 
def consumer(url_queue):
    """网页消费者，从队列中取出URL并爬取内容"""
    while True:
        url = url_queue.get()
        response = requests.get(url)
        if response.status_code == 200:
            # 解析网页内容
            soup = BeautifulSoup(response.text, 'html.parser')
            # 处理soup，提取需要的数据
            # ...
            print(f'Crawled: {url}')
        url_queue.task_done()
 
# 设置最多爬取的页面数
max_pages = 5
 
# 创建生产者线程
producer_thread = threading.Thread(target=producer, args=(url_queue, max_pages))
producer_thread.start()
 
# 创建10个消费者线程
for _ in range(10):
    threading.Thread(target=consumer, args=(url_queue,)).start()
 
# 等待所有任务完成
url_queue.join()

这个代码实例使用了Python的queue模块来创建一个线程安全的队列，用于存储待爬取的URL。producer函数作为生产者向队列中添加网页URL，而consumer函数作为消费者从队列中取出URL并使用requests库进行网页请求，解析内容。这里使用了threading模块来创建多线程，实现了生产者与消费者模式，提高了爬虫的运行效率。

利用Python队列生产者消费者模式构建高效爬虫

评论已关闭

推荐阅读