Python 爬虫—scrapy

这篇文章距离上次修改已过716天，其中的内容可能已经有所变动。

Scrapy是一个用于创建爬虫的开源和跨平台的Python框架。下面是一个使用Scrapy框架的简单爬虫示例，用于抓取一个网站上的所有链接。

首先，安装Scrapy：




pip install scrapy

然后，创建一个新的Scrapy项目：




scrapy startproject myspider

接下来，定义你的爬虫：




import scrapy
 
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']
 
    def parse(self, response):
        for url in response.css('a::attr(href)').getall():
            yield {'url': url}
 
        # 继续抓取下一页的链接，假设分页按钮是一个类 "next"
        next_page = response.css('a.next::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行爬虫：




scrapy crawl myspider -o links.csv

这个爬虫会抓取起始URL（http://example.com）上的所有链接，并将它们输出到\`links.csv\`文件中。如果网站有分页，爬虫还会跟踪下一页的链接并重复这个过程。这只是一个简单的例子，实际的爬虫可能需要根据目标网站的结构进行更复杂的数据解析和处理。

Python 爬虫—scrapy

评论已关闭

推荐阅读