Tech Blog

2024-08-16

Ajax全称为“Asynchronous JavaScript and XML”（异步JavaScript和XML），是一种在网页中实现异步数据交换的技术。通过在后台与服务器交换数据而无需重新加载页面的方式，可以更新网页的部分内容。

下面是一个使用Ajax调用API接口实现聊天机器人功能的简单示例：

HTML部分：




<input type="text" id="userInput" placeholder="Enter your message" />
<button id="sendBtn">Send</button>
<div id="chatLog"></div>

JavaScript部分：




document.getElementById('sendBtn').addEventListener('click', function() {
    var userInput = document.getElementById('userInput').value;
    var xhr = new XMLHttpRequest();
    xhr.open("POST", "https://api.chatbot.com/message", true);
    xhr.setRequestHeader("Content-Type", "application/json");
    xhr.onreadystatechange = function () {
        if (xhr.readyState == 4 && xhr.status == 200) {
            var response = JSON.parse(xhr.responseText);
            var chatLog = document.getElementById('chatLog');
            chatLog.innerHTML += '<p><strong>User:</strong> ' + userInput + '</p>';
            chatLog.innerHTML += '<p><strong>Bot:</strong> ' + response.message + '</p>';
            document.getElementById('userInput').value = '';
        }
    };
    var data = JSON.stringify({ "message": userInput });
    xhr.send(data);
});

在这个例子中，当用户点击发送按钮时，输入框中的消息会通过Ajax以POST请求发送到聊天机器人API接口。服务器处理完后，聊天机器人的回复会被添加到聊天记录中，并清空输入框以准备新的输入。

注意：上述代码中的API接口URL和请求数据格式仅为示例，实际使用时需要替换为实际的API接口信息。

- 阅读更多 -

Nodejs 第五十六章（爬虫）

System

2024-08-16

所有,爬虫

第五六章通常是关于网络爬虫的教学内容，这里我们使用Node.js来创建一个简单的网络爬虫。

首先，我们需要安装一个名为axios的库，这是一个基于promise的HTTP客户端，它允许我们发送HTTP请求。




npm install axios

以下是一个简单的网络爬虫示例，它抓取一个网页并打印出网页的内容：




const axios = require('axios');
 
axios.get('https://www.example.com')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.error(error);
  });

在这个例子中，我们使用axios.get()函数来发送一个HTTP GET请求到指定的URL。然后，我们通过.then()处理程序来处理响应，在控制台上打印出网页的内容。如果请求失败，我们将错误信息打印出来。

这只是一个非常基本的网络爬虫示例，实际的爬虫可能需要处理更复杂的情况，例如多页面抓取、处理JavaScript渲染的网站、处理登录和身份验证、以及遵守网站的robots.txt文件等。

对于更复杂的爬虫，你可能还需要使用例如cheerio的库来解析HTML，以便提取所需的数据。




npm install cheerio

下面是一个使用cheerio的简单例子：




const axios = require('axios');
const cheerio = require('cheerio');
 
axios.get('https://www.example.com')
  .then(response => {
    const $ = cheerio.load(response.data);
    const content = $('#content').text();
    console.log(content);
  })
  .catch(error => {
    console.error(error);
  });

在这个例子中，我们使用cheerio.load()函数来解析返回的HTML，并使用jQuery风格的选择器$('#content')来获取ID为content的元素的文本内容。

- 阅读更多 -

探秘GitHub上的微博爬虫项目：sina_weibo_crawler

System

2024-08-16

所有,爬虫

该项目是一个使用Python编写的微博爬虫，用于爬取新浪微博上用户或话题的数据。以下是核心的爬虫功能代码示例：




import requests
from bs4 import BeautifulSoup
import re
import time
import random
 
def get_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None
 
def parse_page(html, user_id):
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        weibo_list = soup.find_all(class_='weibo')
        for weibo in weibo_list:
            weibo_id = re.findall(r'weibo_id=(\d+)', str(weibo))[0]
            weibo_text = weibo.find(class_='txt').text.strip()
            like_count = re.findall(r'<span class="ctt">(.*?)</span>', str(weibo), re.S)[0]
            comment_count = re.findall(r'<span class="cc">(.*?)</span>', str(weibo), re.S)[0]
            repeat_count = re.findall(r'<span class="cc">(.*?)</span>', str(weibo), re.S)[0]
            # 保存微博数据的逻辑...
 
def crawl_user_weibo(user_id, start_page, end_page):
    for i in range(start_page, end_page + 1):
        url = 'https://weibo.com/p/100505{}/info?mod=pedit_more'.format(user_id)
        html = get_page(url)
        parse_page(html, user_id)
        time.sleep(random.uniform(1, 3))  # 防止被封
 
if __name__ == '__main__':
    user_id = '1234567890'  # 替换为目标用户的ID
    start_page = 1  # 起始页
    end_page = 10  # 结束页
    crawl_user_weibo(user_id, start_page, end_page)

这段代码提供了一个简化的微博爬虫框架，包括了请求页面、解析页面和保存数据的基本方法。实际应用时，你需要提供正确的用户ID，设置爬取的起始页和结束页，并实现数据保存的逻辑。

请注意，此代码仅用于学习目的，实际使用时应遵守新浪微博的爬虫政策，并且不得用于非法活动。

- 阅读更多 -

基于python+flask高校贴吧数据爬虫可视化分析

System

2024-08-16

所有,爬虫




from flask import Flask, render_template, request
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
from io import BytesIO
import base64
 
app = Flask(__name__)
 
@app.route('/')
def index():
    return render_template('index.html')
 
@app.route('/stickers', methods=['POST'])
def stickers():
    board_name = request.form['board_name']
    start_page = request.form['start_page']
    end_page = request.form['end_page']
    df = crawl_stickers(board_name, start_page, end_page)
    image_data = plot_sticker_rating(df)
    return render_template('stickers.html', image_data=image_data)
 
def crawl_stickers(board_name, start_page, end_page):
    # 这里应该是爬虫的实现代码，获取贴吧数据并存储到DataFrame
    pass
 
def plot_sticker_rating(df):
    # 这里应该是绘图的实现代码，将评分按照popularity_score字段绘制为柱状图，并将图像转换为base64字符串
    pass
 
if __name__ == '__main__':
    app.run(debug=True)

这个代码实例提供了一个简化的框架，展示了如何使用Flask框架创建一个简单的Web应用，并结合了爬虫和可视化的功能。在实际应用中，你需要实现crawl_stickers和plot_sticker_rating函数以完成数据爬取和数据分析的具体任务。

- 阅读更多 -

selenium进行xhs图片爬虫：03获取一篇图文的图片

System

2024-08-16

所有,爬虫

在使用Selenium进行图片爬虫时，首先需要定位到包含图片的元素，然后获取元素的属性来获取图片的URL。以下是一个简单的例子，假设我们要获取一篇包含图片的网页中的所有图片链接：




from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
 
# 启动浏览器
driver = webdriver.Chrome()
 
# 打开网页
driver.get('http://example.com/article-with-images')
 
# 等待页面加载完成
wait = WebDriverWait(driver, 10)
 
# 获取文章中的所有图片元素
images = wait.until(EC.presence_of_all_elements_located((By.TAG_NAME, 'img')))
 
# 循环遍历图片元素，获取图片链接
for img in images:
    src = img.get_attribute('src')
    print(src)
 
# 关闭浏览器
time.sleep(5)  # 等待5秒，以便于查看结果
driver.quit()

请确保在运行此代码之前已经安装了Selenium库和相应的WebDriver（如ChromeDriver），并且已经正确配置了路径。此代码会打开指定的网页，加载所有的图片元素，并打印出它们的src属性，即图片链接。

- 阅读更多 -

python爬虫实战——小红书_python小红书爬虫

System

2024-08-16

所有,爬虫




import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
 
# 设置请求头信息，模拟浏览器访问
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
 
# 定义一个函数来获取小红书的所有帖子
def get_all_posts(url):
    # 发送GET请求
    response = requests.get(url, headers=headers)
    # 解析HTML内容
    soup = BeautifulSoup(response.text, 'lxml')
    # 提取帖子信息
    posts = soup.find_all('div', class_='feed-item-root')
    return posts
 
# 定义一个函数来提取帖子详细信息
def extract_post_info(post):
    try:
        # 提取帖子标题
        title = post.find('a', class_='title-box').text.strip()
        # 提取帖子链接
        post_url = post.find('a', class_='title-box')['href']
        # 提取作者昵称和昵称链接
        author_info = post.find('div', class_='author-info').text.strip()
        author_name = re.search('(.+)', author_info).group(1)
        author_url = post.find('a', class_='author-name')['href']
        # 提取作品类型
        media_type = post.find('div', class_='media-type').text.strip()
        # 提取阅读量
        read_count = post.find('div', class_='read-count').text.strip()
        # 提取点赞数
        like_count = post.find('div', class_='like-count').text.strip()
        # 提取评论数
        comment_count = post.find('div', class_='comment-count').text.strip()
        # 提取发布时间
        publish_time = post.find('div', class_='publish-time').text.strip()
        # 返回所有提取的信息
        return {
            'title': title,
            'url': post_url,
            'author_name': author_name,
            'author_url': author_url,
            'media_type': media_type,
            'read_count': read_count,
            'like_count': like_count,
            'comment_count': comment_count,
            'publish_time': publish_time
        }
    except Exception as e:
        print(f'Error extracting post info: {e}')
        return None
 
# 主函数
def main(max_pages):
    # 初始化帖子列表和页码
    posts = []
    page = 1
    
    # 循环遍历页面
    while page <= max_pages:
        print(f"Crawling page {page}")
        # 构造页面URL
        url = f'https://www.xiaohongshu.com/discovery/trending?page={page}'
        # 获取页面所有帖子
        all_posts = get_all_posts(url)
        # 提取每个帖子的详细信息
        for post in all_posts:

- 阅读更多 -

写爬虫代码抓取Asterank中小行星数据

System

2024-08-16

所有,爬虫

由于原始代码已经提供了一个很好的示例，以下是一个简化的代码实例，用于抓取小行星的直径数据：




import requests
from bs4 import BeautifulSoup
 
# 定义一个函数来获取小行星的直径数据
def get_asteroid_diameters(asterank_url):
    # 发送HTTP请求
    response = requests.get(asterank_url)
    response.raise_for_status()  # 检查请求是否成功
 
    # 解析响应内容
    soup = BeautifulSoup(response.text, 'html.parser')
 
    # 找到包含直径信息的表格
    table = soup.find('table', {'class': 'asterank'})
 
    # 初始化直径列表
    diameters = []
 
    # 遍历表格中的所有行，跳过表头
    for row in table.find_all('tr')[1:]:
        # 提取直径数据
        diameter = row.find_all('td')[4].text.strip()
        diameters.append(diameter)
 
    return diameters
 
# 使用函数获取小行星直径数据
asterank_url = 'https://www.asterank.com/neo.php'
diameters = get_asteroid_diameters(asterank_url)
 
# 打印直径数据
for diameter in diameters:
    print(diameter)

这段代码简化了原始代码，去除了原始代码中的一些冗余部分，并且使用更为直接的方法来查找和提取小行星的直径数据。这个示例假设表格结构不会改变，如果结构发生变化，则需要更新解析代码以匹配新的结构。

System

2024-08-16

所有,爬虫




import requests
from bs4 import BeautifulSoup
import datetime
 
def get_weather_data(city):
    # 天气信息网站的URL模板
    url = "http://www.weather.com.cn/weather/{}.shtml".format(city)
    # 发送HTTP请求
    response = requests.get(url)
    # 解析网页
    soup = BeautifulSoup(response.text, 'lxml')
 
    # 提取天气信息
    today_weather = soup.select('#7d .sky .temp')
    today_temperature = today_weather[0].text
    today_weather_icon = soup.select('#7d .sky img')[0]['src']
    today_weather_info = soup.select('#7d .wea')[0].text.strip()
 
    # 打印信息
    print("城市：", city)
    print("今日天气：", today_weather_icon, today_temperature, today_weather_info)
 
# 调用函数，传入城市名，例如 "北京"
get_weather_data("北京")

这段代码使用了requests库来发送HTTP请求，bs4库来解析网页，并使用select方法来定位HTML元素。然后，它提取了今日的天气信息，包括温度和天气状况，并打印出来。这个例子简单直观，适合作为爬虫入门学习的实例。

- 阅读更多 -

探索高效爬虫：QCCSpider - 智能化数据采集的新境界

System

2024-08-16

所有,爬虫

QCCSpider是一个智能化的数据采集框架，它提供了一种新的方式来处理网络爬虫的开发和数据采集过程。以下是一个使用QCCSpider的示例代码：




from qccspider.spider import Spider
from qccspider.selector import Selector
from qccspider.scheduler import Scheduler
from qccspider.pipeline import Pipeline
from qccspider.middlewares import Middlewares
 
# 初始化爬虫
spider = Spider()
 
# 定义爬虫的名字
spider.name = 'example_spider'
 
# 定义启动URL
spider.start_urls = ['http://example.com/']
 
# 定义解析函数
@spider.parse(Selector.type.XPATH, '//a[@href]', process_links=True)
def parse_example(self, response):
    # 提取链接和文本
    for link in response.css('a::attr(href)').extract():
        yield {'url': link, 'text': response.css('a::text').extract_first()}
 
    # 定义数据管道
pipeline = Pipeline()
 
@pipeline.process_item(item_type='*')
def print_item(item):
    print(item)
 
# 运行爬虫
spider.run(pipeline=pipeline)

这段代码定义了一个简单的爬虫，它会从指定的URL开始，并提取页面中所有链接的URL和对应的文本。然后，使用Pipeline打印出每个抓取到的项目。这个例子展示了QCCSpider的基本用法，并且通过使用装饰器的方式定义了解析规则，使得代码更加简洁和易于理解。

System

2024-08-16

所有,爬虫




import redis
 
class RedisClient:
    """
    Redis客户端，用于管理Redis连接。
    """
    def __init__(self, host='localhost', port=6379, db=0):
        self.host = host
        self.port = port
        self.db = db
        self._connection = None
 
    def connect(self):
        """
        建立到Redis服务器的连接。
        """
        self._connection = redis.Redis(host=self.host, port=self.port, db=self.db)
 
    def disconnect(self):
        """
        关闭与Redis服务器的连接。
        """
        if self._connection:
            self._connection.close()
            self._connection = None
 
    def set(self, name, value, ex=None, px=None, nx=False, xx=False):
        """
        设置键值对，使用与redis.Redis.set相同的参数。
        """
        self._connect_if_needed()
        return self._connection.set(name, value, ex, px, nx, xx)
 
    def get(self, name):
        """
        获取键的值，使用与redis.Redis.get相同的参数。
        """
        self._connect_if_needed()
        return self._connection.get(name)
 
    def _connect_if_needed(self):
        """
        如果尚未连接，则建立连接。
        """
        if not self._connection:
            self.connect()
 
# 使用示例
client = RedisClient()
client.set('key', 'value')
print(client.get('key'))
client.disconnect()

这段代码定义了一个名为RedisClient的类，用于管理与Redis服务器的连接。它提供了连接和断开连接的方法，并且在需要时才建立连接。此外，它还提供了set和get方法，以便动态地存储和检索数据。这个类的实例可以用来进行基本的Redis操作，而无需手动管理连接的打开和关闭。

- 阅读更多 -