十个Python爬虫武器库示例,十个爬虫框架,十种实现爬虫的方法!_py 爬虫框架的实例
以下是使用不同Python爬虫库的示例代码。
- 使用requests-html库的简单HTML解析爬虫:
import requests
from requests_html import HTMLSession
session = HTMLSession()
url = 'http://example.com'
response = session.get(url)
# 解析和提取HTML内容
title = response.html.find('title', first=True)
print(title.text)
- 使用BeautifulSoup进行HTML内容解析:
from bs4 import BeautifulSoup
import requests
url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 提取HTML内容
title = soup.find('title')
print(title.string)
- 使用lxml解析XML或HTML内容:
from lxml import etree
import requests
url = 'http://example.com'
response = requests.get(url)
tree = etree.HTML(response.text)
# 提取HTML内容
title = tree.xpath('//title/text()')
print(title[0])
- 使用Scrapy框架创建一个简单的爬虫项目:
scrapy startproject myspider
cd myspider
scrapy genspider example example.com
编辑myspider/spiders/example.py
:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
def parse(self, response):
# 提取HTML内容
title = response.css('title::text').get()
print(title)
运行爬虫:
scrapy crawl example
- 使用Selenium与PhantomJS进行JavaScript渲染的页面爬取:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')
# 提取HTML内容
title = driver.find_element_by_tag_name('title').text
print(title)
driver.quit()
- 使用pyspider框架:
pyspider all
在浏览器中打开http://localhost:5000
并创建一个爬虫项目,pyspider会自动生成爬虫代码。
- 使用aiohttp异步库进行异步网络请求:
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
html = await response.text()
return html
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, 'http://example.com')
print(html)
import asyncio
asyncio.run(main())
- 使用Grab框架进行网页爬取:
from grab import Grab
g = Grab()
g.go('http://example.com')
# 提取HTML内容
print(g.doc.select('title').text())
- 使用PyQuery库进行jQuery风格的HTML解析:
from pyquery import PyQuery as pq
import requests
url = 'http://example.com'
response = requests.get(url)
doc
评论已关闭