【爬虫实战】Python 爬取起点热榜，再也不怕没有小说看了！

这篇文章距离上次修改已过708天，其中的内容可能已经有所变动。




import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# 设置代理服务器
proxies = {
    'http': 'http://user:password@proxy.server.com:port',
    'https': 'https://user:password@proxy.server.com:port',
}
 
# 获取起点小说热榜信息
def get_novel_ranking(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers, proxies=proxies)
    return response.text
 
# 解析网页并提取热榜小说信息
def parse_ranking_data(html):
    soup = BeautifulSoup(html, 'lxml')
    novel_list = []
    for item in soup.select('tr[id^="rankList-"]'):
        data = {
            'ranking': item.select_one('td:nth-of-type(1)').text.strip(),
            'novel': item.select_one('td:nth-of-type(2) a').text.strip(),
            'author': item.select_one('td:nth-of-type(3) a').text.strip(),
            'type': item.select_one('td:nth-of-type(4)').text.strip(),
            'latest_chapter': item.select_one('td:nth-of-type(5) a').text.strip(),
            'latest_update': item.select_one('td:nth-of-type(6)').text.strip(),
        }
        novel_list.append(data)
    return novel_list
 
# 保存数据到CSV文件
def save_to_csv(data, filename):
    df = pd.DataFrame(data)
    df.to_csv(filename, index=False, encoding='utf-8-sig')
 
# 主函数
def main():
    url = 'https://www.qidian.com/rank'
    html = get_novel_ranking(url)
    novel_data = parse_ranking_data(html)
    save_to_csv(novel_data, '起点小说热榜.csv')
 
if __name__ == '__main__':
    main()

这段代码首先定义了一个代理服务器字典，然后定义了获取起点小说热榜信息的函数get_novel_ranking，该函数使用了请求库和代理服务器。接着定义了解析热榜页面的函数parse_ranking_data，它使用了BeautifulSoup和CSS选择器来提取信息。最后，定义了将解析结果保存到CSV文件的函数save_to_csv，并在main函数中调用这些函数来完成整个爬取和保存的过程。

【爬虫实战】Python 爬取起点热榜，再也不怕没有小说看了！

评论已关闭

推荐阅读