Queue的多线程爬虫和multiprocessing多进程

这篇文章距离上次修改已过338天，其中的内容可能已经有所变动。

在Python中，你可以使用queue.Queue来实现多线程爬虫，并结合multiprocessing模块来使用多进程。以下是一个简单的例子，展示了如何使用这两个工具：




import queue
import threading
import multiprocessing
import requests
 
# 多线程爬虫函数
def threaded_crawler(queue, url):
    response = requests.get(url)
    # 假设我们只是简单地打印出响应内容
    print(response.text)
 
# 多进程爬虫函数
def multiprocess_crawler(queue):
    while not queue.empty():
        url = queue.get()
        response = requests.get(url)
        # 假设我们只是简单地打印出响应内容
        print(response.text)
 
# 主程序
def main():
    # 初始化队列
    q = queue.Queue()
    
    # 填充URL
    for i in range(10):  # 示例中使用10个URL
        q.put(f"http://example.com/{i}")
    
    # 多线程爬取
    threads = []
    for i in range(5):  # 假设我们使用5个线程
        t = threading.Thread(target=threaded_crawler, args=(q, f"http://example.com/{i}"))
        threads.append(t)
        t.start()
    
    # 多进程爬取
    # 创建进程池
    with multiprocessing.Pool(processes=5) as pool:
        # 将队列作为参数传递给进程
        pool.map(multiprocess_crawler, [q] * 5)
 
if __name__ == "__main__":
    main()

在这个例子中，我们首先创建了一个queue.Queue，用于存储待爬取的URL。然后，我们启动了多个线程和多个进程，每个线程和进程都从队列中获取URL并进行爬取。

请注意，这只是一个简化的例子，实际的爬虫可能需要更复杂的错误处理、请求优化和分布式策略。此外，由于爬虫可能违反robots.txt协议和服务器的并发请求限制，你应当确保你的爬虫行为符合网站政策，并适当地限制请求频率。

Queue的多线程爬虫和multiprocessing多进程

评论已关闭

推荐阅读