Python Web 爬虫实用指南

作者：System 时间：2024年08月16日分类：所有,爬虫字数：1119

这篇文章距离上次修改已过506天，其中的内容可能已经有所变动。




import requests
from bs4 import BeautifulSoup
 
def get_soup(url):
    """
    获取指定URL的BeautifulSoup对象
    """
    response = requests.get(url)
    if response.status_code == 200:
        return BeautifulSoup(response.text, 'html.parser')
    else:
        return None
 
def get_download_urls(soup):
    """
    从BeautifulSoup对象中提取所有图片下载链接
    """
    # 假设图片链接都在<a>标签的href属性中，且图片扩展名为.jpg
    download_urls = [tag['href'] for tag in soup.find_all('a') if tag['href'].endswith('.jpg')]
    return download_urls
 
def download_images(download_urls, path='images/'):
    """
    将下载链接列表中的图片保存到本地
    """
    for index, url in enumerate(download_urls):
        response = requests.get(url)
        if response.status_code == 200:
            with open(f'{path}image_{index}.jpg', 'wb') as file:
                file.write(response.content)
 
# 示例用法
url = 'http://example.com/gallery'
soup = get_soup(url)
download_urls = get_download_urls(soup)
download_images(download_urls)

这段代码提供了一个简化的示例，展示了如何使用requests和BeautifulSoup库来获取网页内容，解析图片链接，并将这些图片保存到本地文件夹。这个过程是爬虫技术的基本应用，适合作为初学者理解和实践Web爬虫的入门教程。

Python Web 爬虫实用指南

评论已关闭

推荐阅读