探秘GitHub上的微博爬虫项目:sina_weibo_crawler
该项目是一个使用Python编写的微博爬虫,用于爬取新浪微博上用户或话题的数据。以下是核心的爬虫功能代码示例:
import requests
from bs4 import BeautifulSoup
import re
import time
import random
def get_page(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
def parse_page(html, user_id):
if html is not None:
soup = BeautifulSoup(html, 'lxml')
weibo_list = soup.find_all(class_='weibo')
for weibo in weibo_list:
weibo_id = re.findall(r'weibo_id=(\d+)', str(weibo))[0]
weibo_text = weibo.find(class_='txt').text.strip()
like_count = re.findall(r'<span class="ctt">(.*?)</span>', str(weibo), re.S)[0]
comment_count = re.findall(r'<span class="cc">(.*?)</span>', str(weibo), re.S)[0]
repeat_count = re.findall(r'<span class="cc">(.*?)</span>', str(weibo), re.S)[0]
# 保存微博数据的逻辑...
def crawl_user_weibo(user_id, start_page, end_page):
for i in range(start_page, end_page + 1):
url = 'https://weibo.com/p/100505{}/info?mod=pedit_more'.format(user_id)
html = get_page(url)
parse_page(html, user_id)
time.sleep(random.uniform(1, 3)) # 防止被封
if __name__ == '__main__':
user_id = '1234567890' # 替换为目标用户的ID
start_page = 1 # 起始页
end_page = 10 # 结束页
crawl_user_weibo(user_id, start_page, end_page)
这段代码提供了一个简化的微博爬虫框架,包括了请求页面、解析页面和保存数据的基本方法。实际应用时,你需要提供正确的用户ID,设置爬取的起始页和结束页,并实现数据保存的逻辑。
请注意,此代码仅用于学习目的,实际使用时应遵守新浪微博的爬虫政策,并且不得用于非法活动。
评论已关闭