【python】网络爬虫——Scrapy
Scrapy是一个用于创建爬虫的Python库。以下是一个使用Scrapy创建简单爬虫的示例:
首先,安装Scrapy:
pip install scrapy
创建一个新的Scrapy项目:
scrapy startproject myspider
定义爬虫的items.py:
import scrapy
class Myitem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
编写爬虫(spiders/myspider.py):
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/posts']
def parse(self, response):
for href in response.css('.post-link'):
yield {
'title': href.css('a ::text').extract_first(),
'link': href.css('a ::attr(href)').extract_first(),
'desc': href.css('.post-desc ::text').extract_first(),
}
next_page = response.css('.next-page ::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
运行爬虫:
scrapy crawl myspider -o items.json
这个爬虫会抓取example.com网站上的博客文章,包括标题、链接和描述,并将结果保存为JSON文件。这只是一个简单的例子,实际的爬虫可能需要根据目标网站的结构进行更复杂的解析。
评论已关闭