十个Python爬虫武器库示例，十个爬虫框架，十种实现爬虫的方法！_py 爬虫框架的实例

这篇文章距离上次修改已过334天，其中的内容可能已经有所变动。

以下是使用不同Python爬虫库的示例代码。

使用requests-html库的简单HTML解析爬虫：




import requests
from requests_html import HTMLSession
 
session = HTMLSession()
 
url = 'http://example.com'
response = session.get(url)
 
# 解析和提取HTML内容
title = response.html.find('title', first=True)
print(title.text)

使用BeautifulSoup进行HTML内容解析：




from bs4 import BeautifulSoup
import requests
 
url = 'http://example.com'
response = requests.get(url)
 
soup = BeautifulSoup(response.text, 'html.parser')
 
# 提取HTML内容
title = soup.find('title')
print(title.string)

使用lxml解析XML或HTML内容：




from lxml import etree
import requests
 
url = 'http://example.com'
response = requests.get(url)
 
tree = etree.HTML(response.text)
 
# 提取HTML内容
title = tree.xpath('//title/text()')
print(title[0])

使用Scrapy框架创建一个简单的爬虫项目：




scrapy startproject myspider
cd myspider
scrapy genspider example example.com

编辑myspider/spiders/example.py：




import scrapy
 
class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']
 
    def parse(self, response):
        # 提取HTML内容
        title = response.css('title::text').get()
        print(title)

运行爬虫：




scrapy crawl example

使用Selenium与PhantomJS进行JavaScript渲染的页面爬取：




from selenium import webdriver
 
driver = webdriver.PhantomJS()
driver.get('http://example.com')
 
# 提取HTML内容
title = driver.find_element_by_tag_name('title').text
print(title)
 
driver.quit()

使用pyspider框架：




pyspider all

在浏览器中打开http://localhost:5000并创建一个爬虫项目，pyspider会自动生成爬虫代码。

使用aiohttp异步库进行异步网络请求：




import aiohttp
 
async def fetch(session, url):
    async with session.get(url) as response:
        html = await response.text()
        return html
 
async def main():
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, 'http://example.com')
        print(html)
 
import asyncio
asyncio.run(main())

使用Grab框架进行网页爬取：




from grab import Grab
 
g = Grab()
g.go('http://example.com')
 
# 提取HTML内容
print(g.doc.select('title').text())

使用PyQuery库进行jQuery风格的HTML解析：




from pyquery import PyQuery as pq
import requests
 
url = 'http://example.com'
response = requests.get(url)
 
doc

十个Python爬虫武器库示例，十个爬虫框架，十种实现爬虫的方法！_py 爬虫框架的实例

评论已关闭

推荐阅读