Scrapy和Selenium整合（一文搞定）

作者：System 时间：2024年08月16日分类：所有,爬虫字数：1138

这篇文章距离上次修改已过340天，其中的内容可能已经有所变动。




import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']
 
    def __init__(self):
        self.driver = webdriver.Chrome()  # 或者其他浏览器
 
    def parse(self, response):
        self.driver.get(response.url)
        # 假设需要点击一个按钮来加载更多内容
        click_button = self.driver.find_element(By.ID, 'click_button')
        click_button.click()
        
        # 等待内容加载完成
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.ID, 'content'))
        )
        
        # 提取内容
        content = self.driver.find_element(By.ID, 'content').text
        
        yield {
            'content': content,
            # 其他需要提取的数据
        }
 
    def close(self, reason):
        self.driver.quit()  # 关闭浏览器

这个例子展示了如何在Scrapy的爬虫中初始化一个Selenium WebDriver，并在爬取时使用它来模拟用户交互（点击按钮加载更多内容）。在爬取完成后，关闭浏览器，以防止资源泄露。这种方法适用于那些其内容是通过JavaScript动态加载的网站。

Scrapy和Selenium整合（一文搞定）

评论已关闭

推荐阅读