【python】遵守 robots.txt 规则的数据爬虫程序

作者：System 时间：2024年08月19日分类：所有,爬虫字数：1455

这篇文章距离上次修改已过350天，其中的内容可能已经有所变动。




import requests
from bs4 import BeautifulSoup
 
def fetch_website_data(url, keyword):
    headers = {
        'User-Agent': 'Your User Agent',
        'From': 'your-email@example.com'  # 可选，用于需要验证的网站
    }
    try:
        # 检查网站是否存在robots.txt限制
        rt_response = requests.get(f'{url}/robots.txt')
        if rt_response.ok:
            robots_txt = rt_response.text
            if 'Disallow' in robots_txt and keyword in robots_txt:
                print(f'Sorry, the keyword "{keyword}" is not allowed to be crawled according to robots.txt')
                return None
        
        # 发送HTTP请求获取网页数据
        response = requests.get(url, headers=headers)
        if response.ok:
            # 使用BeautifulSoup解析网页
            soup = BeautifulSoup(response.text, 'html.parser')
            return soup
        else:
            print(f'Error fetching data: {response.status_code}')
            return None
    except requests.exceptions.RequestException as e:
        print(f'An error occurred: {e}')
        return None
 
# 使用示例
url = 'https://www.example.com'
soup = fetch_website_data(url, 'Crawler')
if soup:
    # 对soup进行数据解析和提取操作
    print(soup.title)

这段代码首先检查网站的robots.txt文件，确认是否允许爬取指定关键字的数据。如果允许，则使用requests库获取网页数据，并使用BeautifulSoup进行解析。最后提供了一个使用示例，展示了如何调用fetch_website_data函数并处理返回的数据。在实际应用中，需要替换'Your User Agent'和'your-email@example.com'为合法的值，以及修改url变量为你想要爬取的网站。

【python】遵守 robots.txt 规则的数据爬虫程序

评论已关闭

推荐阅读