Python 爬虫从入门到入狱之路一
import requests
from bs4 import BeautifulSoup
# 爬取网页的函数
def crawl_page(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
else:
return None
except requests.RequestException:
return None
# 解析网页并提取信息的函数
def parse_soup(soup):
# 假设我们要提取的信息是标题
title = soup.find('title')
return title.text if title else None
# 主函数,组织爬取和解析的流程
def main():
url = 'https://www.example.com' # 替换为你要爬取的网址
html = crawl_page(url)
if html:
soup = BeautifulSoup(html, 'html.parser')
parsed_info = parse_soup(soup)
if parsed_info:
print(f"The title of the page is: {parsed_info}")
else:
print("No useful data was found.")
else:
print("Failed to retrieve the webpage.")
if __name__ == "__main__":
main()
这段代码展示了如何使用Python的requests库来爬取网页,以及如何使用BeautifulSoup来解析网页并提取信息。代码中的crawl_page
函数负责发送HTTP请求,获取网页内容,而parse_soup
函数负责解析HTML并提取需要的数据。最后,main
函数组合了这两个功能,并且还增加了错误处理和打印输出。
评论已关闭