探索Requests-HTML：Python中的强大网页抓取库

作者：System 时间：2024年08月19日分类：所有,html 字数：936

这篇文章距离上次修改已过355天，其中的内容可能已经有所变动。




import requests
from requests_html import HTMLSession
 
# 创建一个HTMLSession对象，以便使用持久的Cookies和Keep-Alive
session = HTMLSession()
 
# 指定要抓取的URL
url = 'http://example.com/'
 
# 使用get方法发送请求
response = session.get(url)
 
# 检查响应状态
if response.status_code == 200:
    # 打印网页的title
    print(response.html.title)
    
    # 找到所有的链接并打印
    for link in response.html.links:
        print(link)
        
    # 找到所有的图片链接并打印
    for image in response.html.images:
        print(image)
        
    # 使用.find方法找到特定的元素
    container = response.html.find('#container', first=True)
    print(container.text)
    
    # 使用.render方法渲染完整的网页
    response.html.render()
    
    # 保存渲染后的网页到文件
    response.html.save('example.com.html')
else:
    print('Failed to retrieve the webpage')
 
# 清理工作，关闭会话
session.close()

这段代码展示了如何使用requests-html库来抓取一个简单的网页，并提取其中的标题、链接和图片，同时还演示了如何渲染和保存网页。这个库提供了许多便捷的方法来处理HTML内容，使得网络爬虫的开发更加简便和高效。

探索Requests-HTML：Python中的强大网页抓取库

评论已关闭

推荐阅读