基于scrapy+mongodb的校园数据获取
以下是一个简化的例子,展示了如何使用Scrapy和MongoDB来抓取校园网站的数据。
首先,安装必要的库:
pip install scrapy pymongo
然后,创建一个Scrapy项目和一个Spider:
scrapy startproject campus_scraper
cd campus_scraper
scrapy genspider university_spider university.edu
在campus_scraper/items.py
中定义你想要抓取的数据项:
import scrapy
class CampusItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
description = scrapy.Field()
在campus_scraper/spiders/university_spider.py
中编写爬虫逻辑:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from campus_scraper.items import CampusItem
class UniversitySpider(CrawlSpider):
name = 'university_spider'
allowed_domains = ['university.edu']
start_urls = ['http://www.university.edu/']
rules = (
Rule(LinkExtractor(allow=r'/\d*/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = CampusItem()
item['title'] = response.xpath('//h1/text()').extract_first()
item['link'] = response.url
item['description'] = response.xpath('//div[@id="content"]/p/text()').extract_first()
return item
在campus_scraper/pipelines.py
中设置MongoDB管道,用于存储爬取的数据:
import pymongo
class MongoPipeline(object):
collection_name = 'campus_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(dict(item))
return item
在settings.py
中配置MongoDB URI和数据库:
MONGO_URI = 'mongodb://user:password@localhost:27017'
MONGO_DATABASE = 'my_database'
ITEM_PIPELINES = {
'campus_scraper.pipelines.MongoPipeline': 300,
}
最后,运行爬虫:
scrapy crawl university_spider
这个例子展示了如何使用Scrapy和MongoDB来存储爬取的数据。这个例子假设你已经有了合适的权
评论已关闭