爬虫新闻网站并存储到CSV文件以红网为例 V2.0 （控制台版）升级自定义查询关键词、时间段，详细注释

这篇文章距离上次修改已过433天，其中的内容可能已经有所变动。




import requests
from bs4 import BeautifulSoup
import csv
import time
import random
 
def get_soup(url):
    """
    获取网页内容并返回BeautifulSoup对象
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return BeautifulSoup(response.text, 'html.parser')
    return None
 
def get_news_info(news_url, soup):
    """
    获取新闻详情页的信息
    """
    title = soup.select('.news-title')[0].text
    info = soup.select('.news-info')[0].text
    publish_time = info.split()[0]
    author = info.split()[-1]
    content = soup.select('.content')[0].text.strip()
    return title, publish_time, author, content
 
def main(key_word, start_date, end_date, output_file):
    """
    主函数，控制爬虫流程
    """
    base_url = 'http://www.people.com.cn/search/search_new.jsp?keyword={}&startdate={}&enddate={}&pageno={}'
    news_urls = set()  # 使用集合存储新闻链接，避免重复
    for i in range(1, 11):  # 假设只爬取前10页的新闻
        url = base_url.format(key_word, start_date, end_date, i)
        soup = get_soup(url)
        if soup is None:
            continue
        for news in soup.select('.news-list li'):
            if len(news_urls) >= 100:  # 假设只爬取前100条新闻
                break
            news_url = news.select('a')[0]['href']
            news_urls.add(news_url)
    with open(output_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['标题', '发布时间', '来源', '内容'])
        for url in news_urls:
            soup = get_soup(url)
            if soup is not None:
                title, publish_time, author, content = get_news_info(url, soup)
                writer.writerow([title, publish_time, author, content])
            time.sleep(random.uniform(0.5, 1.5))  # 添加随机等待防止被封IP
 
if __name__ == '__main__':
    key_word = input("请输入关键词进行搜索：")
    start_date = input("请输入开始日期(格式YYYY-MM-DD)：")
    end_date = input("请输入结束日期(格式YYYY-MM-DD)：")
    output_file = 'rednet_news_{}_{}_{}.csv'.format(key_word, start_date, end_date)
    main(key_word, start_date, end_date, output_file)

这段代码实现了基本的新闻爬取功能，并将结果保存到CSV文件中。用户可以通过控制台输入自定义的搜索关键词、开始和结束日期，程序将爬取对应关键词在指定时间范围内的新闻，并保存到CSV文件中。这个版本的代码相较于V1.0版本，增加了自定义搜索关键词的能力，并对爬取过程增加了随机等待，以防止被封禁IP。

爬虫新闻网站并存储到CSV文件以红网为例 V2.0 （控制台版）升级自定义查询关键词、时间段，详细注释

评论已关闭

推荐阅读