System 发布的文章

2024-08-13




/* 定义加载动画的基本样式 */
.loading-animation {
    width: 50px;
    height: 50px;
    border: 6px solid #f3f3f3; /* 边框颜色与背景色相近，看起来就像是在加载 */
    border-top: 6px solid #3498db; /* 顶部边框颜色，可以改成你喜欢的颜色 */
    border-radius: 50%; /* 边框圆角，形成圆形 */
    animation-name: spin; /* 使用@keyframes定义的动画名称 */
    animation-duration: 20s; /* 动画持续时间 */
    animation-iteration-count: infinite; /* 动画无限次数播放 */
    animation-timing-function: linear; /* 动画速度时间函数 */
    margin: 50px 0; /* 外边距，使得动画在页面中垂直居中 */
}
 
/* 定义旋转动画 */
@keyframes spin {
    from {transform: rotate(0deg);}
    to {transform: rotate(360deg);}
}

这段代码定义了一个简单的加载动画，通过旋转边框来模拟加载效果。你可以将这个类应用到任何HTML元素上，使其展现出旋转的加载动画。这是一个很好的教学示例，展示了如何使用CSS创建交互式动画效果。

- 阅读更多 -

SpringBoot 中间件设计和开发【分库分表组件简易版】

System

2024-08-13

所有,中间件

以下是一个简化版的分库分表组件的核心方法示例，仅包含核心功能：




import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
 
@Component
public class SimpleShardService {
 
    @Autowired
    private DataSourceRoutingDataSource dataSourceRoutingDataSource;
 
    public void setDataSourceKey(String dataSourceKey) {
        // 设置当前线程使用的数据源key
        DataSourceContextHolder.setDataSourceKey(dataSourceKey);
    }
 
    public void clearDataSourceKey() {
        // 清除当前线程使用的数据源key
        DataSourceContextHolder.clearDataSourceKey();
    }
 
    public void shard(String key, String value) {
        // 根据key和value进行分库分表逻辑处理
        String dataSourceKey = generateDataSourceKey(key, value);
        setDataSourceKey(dataSourceKey);
        // 执行业务操作...
        // 业务操作完成后清除数据源key
        clearDataSourceKey();
    }
 
    private String generateDataSourceKey(String key, String value) {
        // 这里只是一个示例，实际的分库分表逻辑需要根据key和value进行计算
        // 返回计算后的数据源key
        return "db_" + key + "_" + value;
    }
}

这个示例中，SimpleShardService 类提供了设置和清除数据源key的方法，以及一个包含分库分表逻辑的shard方法。generateDataSourceKey方法用于模拟根据key和value生成数据源key的过程。在实际应用中，这个方法需要根据具体的分库分表规则来实现。

- 阅读更多 -

性能测试 —— Nginx中间件监控与调优！

System

2024-08-13

所有,中间件

Nginx 是一款高性能的 HTTP 和反向代理服务器，也是一个 IMAP/POP3/SMTP 代理服务器。在进行性能测试时，监控和调优 Nginx 的性能非常重要。以下是一些常用的 Nginx 监控和调优技巧：

配置监控指标：确保 Nginx 配置了状态模块，以便可以通过 stub_status 获取性能数据。
使用 Nginx 状态页面：配置一个专用的 location 块来提供状态数据。




server {
    listen 80;
    server_name localhost;
 
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1; # 只允许本地访问
        deny all;        # 拒绝其他IP访问
    }
}

监控工具：使用第三方工具比如 Nagios、New Relic 或 Prometheus 与 Nginx 集成，实时监控性能指标。
调优配置：根据监控结果调整 Nginx 配置，常见调优参数包括 worker_processes、worker_connections、keepalive_timeout 等。
压力测试：使用 ab (Apache Benchmark) 或 wrk 等工具对 Nginx 进行压力测试，以识别瓶颈。
查看日志：分析 Nginx 访问日志和错误日志，识别可能的性能瓶颈。
优化配置文件：确保配置文件的位置、大小写敏感性、指令优化等。
使用 opentracing：通过集成 opentracing 和 Jaeger 或其他分布式追踪系统，可以追踪请求在系统中的流动情况。
启用 gzip 压缩：开启 gzip 压缩可以减少网络传输数据量，加快页面加载速度。
开启 HTTP/2：如果支持的话，启用 HTTP/2 可以提高性能，因为它可以多路复用连接。
关闭不必要的模块：只保留必要的 Nginx 模块，关闭不需要的，以减少资源消耗。
使用缓存：配置合适的缓存策略，减少后端服务压力。
使用第三方模块：如果特定场景需要，可以考虑使用第三方高级模块，如 ngx\_http\_geoip\_module 用于地理位置信息处理。
监控系统资源：监控 CPU、内存、网络和磁盘等系统资源，以确保 Nginx 不会成为瓶颈。
自动扩展：使用 Kubernetes 或 Docker 等容器编排工具可以实现自动扩展 Nginx 服务。
定期更新：保持 Nginx 更新到最新版本，以获取最新的性能优化和安全修复。

以上是 Nginx 监控和调优的一些基本方法，具体实施时需要根据实际情况和需求进行调整。

- 阅读更多 -

c#如何爬虫

System

2024-08-13

所有,爬虫

在C#中实现网页爬虫，你可以使用HttpClient类来发送HTTP请求，获取网页内容。以下是一个简单的示例，展示了如何使用HttpClient获取网页内容并简单地打印出来：




using System;
using System.Net.Http;
using System.Threading.Tasks;
 
class Program
{
    static async Task Main(string[] args)
    {
        using (HttpClient client = new HttpClient())
        {
            // 设置用户代理，一些服务器可能会检查这个来限制爬虫
            client.DefaultRequestHeaders.UserAgent.Parse("MyCrawler/1.0");
 
            // 要爬取的网页URL
            string url = "http://example.com";
 
            try
            {
                // 发送HTTP GET请求
                HttpResponseMessage response = await client.GetAsync(url);
 
                // 确保请求成功
                response.EnsureSuccessStatusCode();
 
                // 读取响应内容作为字符串
                string content = await response.Content.ReadAsStringAsync();
 
                // 打印网页内容
                Console.WriteLine(content);
            }
            catch (HttpRequestException e)
            {
                Console.WriteLine("HTTP请求错误：" + e.Message);
            }
        }
    }
}

这个例子中，我们使用了HttpClient的异步方法来发送HTTP请求，并读取响应内容。这样可以避免阻塞UI线程，并允许在等待网络响应时继续执行其他任务。

请注意，实际的网页爬虫可能需要处理更复杂的情况，比如网页内容的解析、多页面抓取、处理JavaScript渲染的页面、处理Cookies、登录认证、反爬虫策略（如CAPTCHA）等。这个例子只是一个简单的开始。

- 阅读更多 -

python第一个多进程爬虫

System

2024-08-13

所有,爬虫

以下是一个简单的Python多进程爬虫示例，使用multiprocessing库来实现多进程：




import requests
from lxml import etree
from multiprocessing import Pool
 
def fetch_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
 
def parse_page(html):
    tree = etree.HTML(html)
    # 假设我们要解析的内容是页面中的标题
    title = tree.xpath('//title/text()')
    return title[0] if title else None
 
def main():
    urls = ['http://example.com/page1', 'http://example.com/page2', 'http://example.com/page3']
    with Pool(processes=4) as pool:
        results = pool.map(fetch_page, urls)
    
    parsed_results = pool.map(parse_page, results)
    for title in parsed_results:
        print(title)
 
if __name__ == "__main__":
    main()

这段代码定义了一个多进程爬虫的基本框架：

fetch_page 函数负责获取页面内容。
parse_page 函数负责解析页面内容。
main 函数中，我们创建了一个进程池，并使用它来并发地执行页面抓取和解析任务。

请根据实际需求调整fetch_page和parse_page函数中的实现细节。

- 阅读更多 -

Python实战实例代码-网络爬虫-数据分析-机器学习-图像处理

System

2024-08-13

所有,爬虫

由于提出的问题较为宽泛，并未指定具体的实战案例，我将给出一个基于Python的简单网络爬虫实例，用于抓取一个网页上的链接，并进行简单的数据分析。




import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# 目标网页
url = 'https://example.com'
 
# 发送HTTP请求
response = requests.get(url)
 
# 检查请求是否成功
if response.status_code == 200:
    # 使用BeautifulSoup解析网页内容
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 找到所有的链接
    links = soup.find_all('a')
    
    # 提取链接的文本和URL
    data = [[link.get_text(), link.get('href')] for link in links]
    
    # 转换为pandas DataFrame
    df = pd.DataFrame(data, columns=['Text', 'URL'])
    
    # 简单的数据分析，比如统计每个域名出现的次数
    domain_counts = df['URL'].str.extract(r'https?:\/\/(?:www\.)?(.+?)[\/:]')
    print(domain_counts.value_counts())
    
    # 保存结果到CSV文件
    df.to_csv('example_links.csv', index=False)
else:
    print('Failed to retrieve the webpage')

这段代码展示了如何使用Python的requests库获取网页内容，使用BeautifulSoup解析网页，提取链接信息，并使用pandas进行数据分析。最后，代码将分析结果打印出来并保存到CSV文件中。这个实例简单直观，适合作为初学者学习网络爬虫的起点。

System

2024-08-13

所有,爬虫




import requests
from bs4 import BeautifulSoup
import csv
import time
import random
 
def get_soup(url):
    """
    获取网页内容并返回BeautifulSoup对象
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return BeautifulSoup(response.text, 'html.parser')
    return None
 
def get_news_info(news_url, soup):
    """
    获取新闻详情页的信息
    """
    title = soup.select('.news-title')[0].text
    info = soup.select('.news-info')[0].text
    publish_time = info.split()[0]
    author = info.split()[-1]
    content = soup.select('.content')[0].text.strip()
    return title, publish_time, author, content
 
def main(key_word, start_date, end_date, output_file):
    """
    主函数，控制爬虫流程
    """
    base_url = 'http://www.people.com.cn/search/search_new.jsp?keyword={}&startdate={}&enddate={}&pageno={}'
    news_urls = set()  # 使用集合存储新闻链接，避免重复
    for i in range(1, 11):  # 假设只爬取前10页的新闻
        url = base_url.format(key_word, start_date, end_date, i)
        soup = get_soup(url)
        if soup is None:
            continue
        for news in soup.select('.news-list li'):
            if len(news_urls) >= 100:  # 假设只爬取前100条新闻
                break
            news_url = news.select('a')[0]['href']
            news_urls.add(news_url)
    with open(output_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['标题', '发布时间', '来源', '内容'])
        for url in news_urls:
            soup = get_soup(url)
            if soup is not None:
                title, publish_time, author, content = get_news_info(url, soup)
                writer.writerow([title, publish_time, author, content])
            time.sleep(random.uniform(0.5, 1.5))  # 添加随机等待防止被封IP
 
if __name__ == '__main__':
    key_word = input("请输入关键词进行搜索：")
    start_date = input("请输入开始日期(格式YYYY-MM-DD)：")
    end_date = input("请输入结束日期(格式YYYY-MM-DD)：")
    output_file = 'rednet_news_{}_{}_{}.csv'.format(key_word, start_date, end_date)
    main(key_word, start_date, end_date, output_file)

这段代码实现了基本的新闻爬取功能，并将结果保存到CSV文件中。用户可以通过控制台输入自定义的搜索关键词、开始和结束日期，程序将爬取对应关键词在指定时间范围内的新闻，并保存到CSV文件中。这个版本的代码相较于V1.0版本，增加了自定义搜索关键词的能力，并对爬取过程增加了随机等待，以防止被封禁IP。

- 阅读更多 -

【爬虫学习】用PHP爬取百度热搜榜数据

System

2024-08-13

所有,爬虫




<?php
// 初始化cURL会话
$ch = curl_init();
 
// 设置cURL选项
curl_setopt($ch, CURLOPT_URL, "https://top.baidu.com/board?tab=realtime"); // 要访问的网站
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // 将curl_exec()获取的信息以字符串返回，而不是直接输出
curl_setopt($ch, CURLOPT_HEADER, 0); // 不需要头部信息
 
// 执行cURL会话
$content = curl_exec($ch);
 
// 关闭cURL会话
curl_close($ch);
 
// 使用DOM解析器解析网页内容
$dom = new DOMDocument();
@$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
 
// 查询所有热搜词列表项
$hotSearches = $xpath->query("//div[@class='title']/div[@class='content']/ul/li/a");
 
// 循环遍历并打印热搜词
foreach ($hotSearches as $hotSearch) {
    echo $hotSearch->nodeValue . "\n";
}
?>

这段代码使用PHP的cURL函数库来发送HTTP请求，获取百度热搜榜的页面内容，然后使用DOM解析器来解析页面中的热搜词数据，并打印出来。这个例子展示了如何使用PHP进行简单的网络爬虫编程。

- 阅读更多 -

毕设基于Python实现京东商城爬虫

System

2024-08-13

所有,爬虫

以下是一个简化版的京东商城商品信息爬虫的示例代码，使用Python的requests和BeautifulSoup库。




import requests
from bs4 import BeautifulSoup
import csv
 
def crawl_jd(keyword, page_num):
    # 初始化商品列表
    products = []
 
    for i in range(1, page_num+1):
        print(f"正在爬取第{i}页...")
        url = f"https://search.jd.com/Search?keyword={keyword}&page={i}"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'lxml')
            product_list = soup.find_all('div', class_='gl-item')
            for product in product_list:
                # 提取商品名称和价格
                name = product.find('div', class_='p-name').a.text.strip()
                price = product.find('div', class_='p-price').strong.text.strip()
                products.append({'name': name, 'price': price})
        else:
            print("爬取失败")
            break
 
    return products
 
def save_to_csv(products, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['name', 'price'])
        writer.writeheader()
        for product in products:
            writer.writerow(product)
 
if __name__ == '__main__':
    keyword = '手机'  # 替换为你想要搜索的商品关键词
    page_num = 2  # 设置爬取的页数
    products = crawl_jd(keyword, page_num)
    save_to_csv(products, 'jd_products.csv')

这段代码实现了基本的爬虫功能，包括爬取指定关键词的商品信息，并将其保存到CSV文件中。需要注意的是，该代码仅用于学习和测试目的，实际使用时应遵守相关法律法规，并遵守京东的爬虫政策。

- 阅读更多 -

使用 GO 和 Python 分别写爬虫的区别

System

2024-08-13

所有,爬虫

写爬虫的主要区别在于语言特性和库的支持。Python 更适合编写简洁的网络爬虫，而 Go 提供了强大的并发处理能力和语言级别的网络请求库（如net/http和html/template）。

以下是使用 Python 和 Go 编写简单网络爬虫的比较：

Python 示例（使用requests和beautifulsoup4）:




import requests
from bs4 import BeautifulSoup
 
def crawl_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup.prettify()
    else:
        return "Error: {}".format(response.status_code)
 
url = "https://example.com"
print(crawl_page(url))

Go 示例（使用net/http标准库和golang.org/x/net/html）:




package main
 
import (
    "fmt"
    "net/http"
    "golang.org/x/net/html"
    "os"
)
 
func crawlPage(url string) (string, error) {
    resp, err := http.Get(url)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()
 
    doc, err := html.Parse(resp.Body)
    if err != nil {
        return "", err
    }
 
    return html.NodeFilter(doc), nil
}
 
func main() {
    url := "https://example.com"
    if content, err := crawlPage(url); err != nil {
        fmt.Fprintf(os.Stderr, "Error: %s\n", err)
    } else {
        fmt.Println(content)
    }
}

在 Python 中，你需要使用requests库来发送 HTTP 请求，并使用beautifulsoup4来解析 HTML。Go 标准库net/http用于发送请求，而golang.org/x/net/html用于解析 HTML。

在 Go 中，你可以直接操作解析后的 HTML 文档，而 Python 需要将文档解析为一个可以操作的对象。Go 的标准库和第三方库通常提供了丰富的功能，而 Python 则依赖于beautifulsoup4和lxml等第三方库。

在并发处理上，Go 天生支持并发，使用goroutines和channels可以轻松编写并发的网络爬虫。而 Python 需要使用threading或multiprocessing库，或者使用asyncio（Python 3.4+）和aiohttp库来编写异步代码。

综上所述，Python 更适合快速开发和原型设计，而 Go 更适合大规模的、需要高性能的网络爬虫。

- 阅读更多 -