【python】网络爬虫——Scrapy

作者：System 时间：2024年08月10日分类：所有,爬虫字数：1244

这篇文章距离上次修改已过342天，其中的内容可能已经有所变动。

Scrapy是一个用于创建爬虫的Python库。以下是一个使用Scrapy创建简单爬虫的示例：

首先，安装Scrapy：




pip install scrapy

创建一个新的Scrapy项目：




scrapy startproject myspider

定义爬虫的items.py：




import scrapy
 
class Myitem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()

编写爬虫（spiders/myspider.py）：




import scrapy
 
class MySpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/posts']
 
    def parse(self, response):
        for href in response.css('.post-link'):
            yield {
                'title': href.css('a ::text').extract_first(),
                'link': href.css('a ::attr(href)').extract_first(),
                'desc': href.css('.post-desc ::text').extract_first(),
            }
 
        next_page = response.css('.next-page ::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行爬虫：




scrapy crawl myspider -o items.json

这个爬虫会抓取example.com网站上的博客文章，包括标题、链接和描述，并将结果保存为JSON文件。这只是一个简单的例子，实际的爬虫可能需要根据目标网站的结构进行更复杂的解析。

【python】网络爬虫——Scrapy

评论已关闭

推荐阅读