Python 爬虫从入门到入狱之路一

作者：System 时间：2024年08月16日分类：所有,爬虫字数：1083

这篇文章距离上次修改已过352天，其中的内容可能已经有所变动。




import requests
from bs4 import BeautifulSoup
 
# 爬取网页的函数
def crawl_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        else:
            return None
    except requests.RequestException:
        return None
 
# 解析网页并提取信息的函数
def parse_soup(soup):
    # 假设我们要提取的信息是标题
    title = soup.find('title')
    return title.text if title else None
 
# 主函数，组织爬取和解析的流程
def main():
    url = 'https://www.example.com'  # 替换为你要爬取的网址
    html = crawl_page(url)
    if html:
        soup = BeautifulSoup(html, 'html.parser')
        parsed_info = parse_soup(soup)
        if parsed_info:
            print(f"The title of the page is: {parsed_info}")
        else:
            print("No useful data was found.")
    else:
        print("Failed to retrieve the webpage.")
 
if __name__ == "__main__":
    main()

这段代码展示了如何使用Python的requests库来爬取网页，以及如何使用BeautifulSoup来解析网页并提取信息。代码中的crawl_page函数负责发送HTTP请求，获取网页内容，而parse_soup函数负责解析HTML并提取需要的数据。最后，main函数组合了这两个功能，并且还增加了错误处理和打印输出。

Python 爬虫从入门到入狱之路一

评论已关闭

推荐阅读