爬虫 新闻网站 并存储到CSV文件 以红网为例 V2.0 (控制台版)升级自定义查询关键词、时间段,详细注释
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
def get_soup(url):
"""
获取网页内容并返回BeautifulSoup对象
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return BeautifulSoup(response.text, 'html.parser')
return None
def get_news_info(news_url, soup):
"""
获取新闻详情页的信息
"""
title = soup.select('.news-title')[0].text
info = soup.select('.news-info')[0].text
publish_time = info.split()[0]
author = info.split()[-1]
content = soup.select('.content')[0].text.strip()
return title, publish_time, author, content
def main(key_word, start_date, end_date, output_file):
"""
主函数,控制爬虫流程
"""
base_url = 'http://www.people.com.cn/search/search_new.jsp?keyword={}&startdate={}&enddate={}&pageno={}'
news_urls = set() # 使用集合存储新闻链接,避免重复
for i in range(1, 11): # 假设只爬取前10页的新闻
url = base_url.format(key_word, start_date, end_date, i)
soup = get_soup(url)
if soup is None:
continue
for news in soup.select('.news-list li'):
if len(news_urls) >= 100: # 假设只爬取前100条新闻
break
news_url = news.select('a')[0]['href']
news_urls.add(news_url)
with open(output_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['标题', '发布时间', '来源', '内容'])
for url in news_urls:
soup = get_soup(url)
if soup is not None:
title, publish_time, author, content = get_news_info(url, soup)
writer.writerow([title, publish_time, author, content])
time.sleep(random.uniform(0.5, 1.5)) # 添加随机等待防止被封IP
if __name__ == '__main__':
key_word = input("请输入关键词进行搜索:")
start_date = input("请输入开始日期(格式YYYY-MM-DD):")
end_date = input("请输入结束日期(格式YYYY-MM-DD):")
output_file = 'rednet_news_{}_{}_{}.csv'.format(key_word, start_date, end_date)
main(key_word, start_date, end_date, output_file)
这段代码实现了基本的新闻爬取功能,并将结果保存到CSV文件中。用户可以通过控制台输入自定义的搜索关键词、开始和结束日期,程序将爬取对应关键词在指定时间范围内的新闻,并保存到CSV文件中。这个版本的代码相较于V1.0版本,增加了自定义搜索关键词的能力,并对爬取过程增加了随机等待,以防止被封禁IP。
评论已关闭