笔记-python编写ZH爬虫实践
import requests
from bs4 import BeautifulSoup
import re
import os
def get_html(url):
"""
获取网页内容
"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
except requests.RequestException:
return None
def parse_html(html):
"""
解析网页内容,提取标题和内容
"""
soup = BeautifulSoup(html, 'lxml')
title = soup.select('.entry-header h1')[0].text
content = soup.select('.entry')[0].text
return title, content
def save_note(title, content, save_dir):
"""
保存笔记
"""
with open(os.path.join(save_dir, f'{title}.txt'), 'w', encoding='utf-8') as f:
f.write(content)
def main(url):
html = get_html(url)
title, content = parse_html(html)
save_note(title, content, 'notes')
if __name__ == '__main__':
url = 'http://example.com/some-article' # 替换为目标文章的URL
main(url)
这段代码实现了一个简单的网页爬取和保存文本内容的功能。首先定义了获取网页内容的函数get_html
,使用了请求库requests获取网页。其次定义了解析网页内容的函数parse_html
,使用了BeautifulSoup进行HTML内容的解析。最后定义了保存笔记的函数save_note
,将解析得到的标题和内容保存到指定的目录。主函数main
组合了这些功能,实现了一个简单的爬虫流程。
评论已关闭