23个Python爬虫开源项目代码_python开源爬虫

作者：System 时间：2024年08月11日分类：所有,爬虫字数：894

这篇文章距离上次修改已过323天，其中的内容可能已经有所变动。

以下是第二个Python开源爬虫项目的代码实例：




import requests
from bs4 import BeautifulSoup
 
def get_links(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        return [url + '/' + str(i['href']) for i in soup.find_all('a') if 'href' in i]
    else:
        print('Error:', response.status_code)
        return []
 
def crawl(url, depth=2):
    queue = [url]
    visited = set()
    while queue and depth > 0:
        url = queue.pop()
        if url not in visited:
            visited.add(url)
            print('Crawling:', url)
            links = get_links(url)
            queue.extend(links)
        depth -= 1
 
crawl('https://www.example.com', depth=2)

这段代码实现了一个简单的网页爬虫，它从一个起始URL开始，通过深度优先搜索算法（DFS）遍历网页中的所有链接。代码使用了requests库来发送HTTP请求，以及BeautifulSoup库来解析HTML。这个简单的爬虫可以作为学习如何进行网页爬取的起点。

23个Python爬虫开源项目代码_python开源爬虫

评论已关闭

推荐阅读