Python Web 爬虫实用指南
import requests
from bs4 import BeautifulSoup
def get_soup(url):
"""
获取指定URL的BeautifulSoup对象
"""
response = requests.get(url)
if response.status_code == 200:
return BeautifulSoup(response.text, 'html.parser')
else:
return None
def get_download_urls(soup):
"""
从BeautifulSoup对象中提取所有图片下载链接
"""
# 假设图片链接都在<a>标签的href属性中,且图片扩展名为.jpg
download_urls = [tag['href'] for tag in soup.find_all('a') if tag['href'].endswith('.jpg')]
return download_urls
def download_images(download_urls, path='images/'):
"""
将下载链接列表中的图片保存到本地
"""
for index, url in enumerate(download_urls):
response = requests.get(url)
if response.status_code == 200:
with open(f'{path}image_{index}.jpg', 'wb') as file:
file.write(response.content)
# 示例用法
url = 'http://example.com/gallery'
soup = get_soup(url)
download_urls = get_download_urls(soup)
download_images(download_urls)
这段代码提供了一个简化的示例,展示了如何使用requests和BeautifulSoup库来获取网页内容,解析图片链接,并将这些图片保存到本地文件夹。这个过程是爬虫技术的基本应用,适合作为初学者理解和实践Web爬虫的入门教程。
评论已关闭