【python】六个常见爬虫案例
爬虫是一种自动提取网页数据的程序,Python 是实现爬虫的常用语言之一。下面是六个常见的 Python 爬虫案例及其源代码。
- 使用requests和BeautifulSoup库爬取简单网页:
import requests
from bs4 import BeautifulSoup
url = 'http://example.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())
- 使用Scrapy框架爬取复杂网页:
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
for href in response.css('a::attr(href)'):
yield response.follow(href, self.parse)
- 使用Selenium和PhantomJS进行JavaScript渲染的网页爬取:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')
print(driver.page_source)
driver.quit()
- 使用pymysql将爬取的数据存入MySQL数据库:
import pymysql
conn = pymysql.connect(host='localhost', user='user', password='passwd', db='db', charset='utf8')
cur = conn.cursor()
sql = "INSERT INTO example (column) VALUES (%s)"
cur.execute(sql, data)
conn.commit()
cur.close()
conn.close()
- 使用aiohttp异步库爬取网页(适合处理大量网页):
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, 'http://example.com')
print(html)
import asyncio
asyncio.get_event_loop().run_until_complete(main())
- 使用redis-py将爬虫队列和设置去重存储在Redis数据库:
import redis
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
redis_key = settings['REDIS_ITEM_KEY']
redis_conn = redis.StrictRedis(host='localhost', port=6379, db=0)
def process_item(item):
redis_conn.lpush(redis_key, item)
return item
这些例子提供了不同的爬虫方法和技术,可以根据实际需求选择合适的方案。
评论已关闭