分类后端技术下的文章

2024-08-25

在使用Flink SQL连接Elasticsearch（ES）作为sink时，如果你指定了主键（primary key），但数据仍然被覆盖，可能的原因和解决方法如下：

原因1：Flink SQL的Elasticsearch sink默认情况下使用_id字段作为主键。如果你的数据中没有_id字段，或者字段名不是_id，Flink可能不会识别你指定的字段作为主键。

解决方法：确保你的数据中有一个字段名为_id，这个字段将作为Elasticsearch的文档主键。如果你的主键字段名不是_id，你可以在Flink SQL DDL中指定字段作为主键。

原因2：Elasticsearch的写操作默认是create，这意味着每次写入时，如果_id已存在，则会创建一个新的文档，覆盖旧的文档。

解决方法：要解决这个问题，你需要将Elasticsearch的写操作设置为update。在Flink的Elasticsearch sink中，可以通过设置sink.bulk-flush.backoff.type为UPDATE来实现。

请确保在Flink的配置中添加如下设置：




'sink.bulk-flush.max-actions': '1'
'sink.bulk-flush.max-size': '1mb'
'sink.bulk-flush.interval': '1s'
'sink.bulk-flush.backoff.type': 'UPDATE'
'sink.bulk-flush.backoff.max-retries': '1'

这样配置后，当Flink尝试写入数据到Elasticsearch时，如果_id已存在，它将尝试更新现有文档而不是覆盖它。如果你的数据中包含了_id字段，并且你已经在Flink SQL DDL中正确指定了主键，这些设置应该可以避免数据被覆盖的问题。

System

2024-08-25

所有,elasticsearch




// 引入相关库
import * as THREE from 'three';
import Stats from 'three/examples/jsm/libs/stats.module.js';
import { OrbitControls } from 'three/examples/jsm/controls/OrbitControls.js';
import { GLTFLoader } from 'three/examples/jsm/loaders/GLTFLoader.js';
 
// 场景、摄像机、渲染器等初始化代码略...
 
// 加载模型
const loader = new GLTFLoader();
loader.load( 'models/animals/horse.gltf', function ( gltf ) {
 
    // 获取模型中的角色
    const horse = gltf.scene.children[ 0 ];
    horse.scale.set( 0.01, 0.01, 0.01 ); // 缩小模型
    scene.add( horse );
 
    // 设置动画和关键帧
    const mixer = new THREE.AnimationMixer( horse );
    const action = mixer.clipAction( gltf.animations[ 0 ] );
 
    // 播放动画
    action.play();
 
    // 更新动画
    mixer.timeScale = 0.8; // 放慢动画
 
    // 创建关键帧
    const keyframes = [];
    keyframes.push({ time: 0, position: horse.position.clone() }); // 记录当前位置
 
    // 设置动画更新时的回调函数
    const clock = new THREE.Clock();
    const update = function ( deltaTime ) {
        const time = clock.getElapsedTime();
 
        // 每隔一定时间记录关键帧
        if ( time > keyframes[ keyframes.length - 1 ].time + 2 ) {
            keyframes.push({
                time: time,
                position: horse.position.clone()
            });
        }
 
        // 插值计算关键帧之间的位置
        const keys = keyframes.length;
        if ( keys > 1 ) {
            const prevKey = keyframes[ keys - 2 ];
            const nextKey = keyframes[ keys - 1 ];
            const p = ( time - prevKey.time ) / ( nextKey.time - prevKey.time );
            horse.position.lerpVectors( prevKey.position, nextKey.position, p );
        }
 
        // 更新动画
        mixer.update( deltaTime );
 
        // 渲染场景
        renderer.render( scene, camera );
    };
 
    // 监听浏览器窗口大小变化
    window.addEventListener( 'resize', onWindowResize );
 
    // 渲染循环
    function animate() {
        requestAnimationFrame( animate );
        update();
    }
 
    animate();
 
}, undefined, function ( error ) {
    console.error( error );
} );
 
// 其他函数和监听器略...

这段代码示例展示了如何在Three.js中加载一个glTF模型，并设置其动画和关键帧记录。它演示了如何使用THREE.AnimationMixer来播放和控制模型的动画，并使用关键帧数组来记录和插值计算模型的位置。最后，它提供了一个update函数，该函数在每一帧调用，用于更新动画状态和渲染场景。

- 阅读更多 -

在Ubuntu 16.04上安装和配置Elasticsearch的方法

System

2024-08-25

所有,elasticsearch




#!/bin/bash
 
# 更新系统包索引
sudo apt-get update
 
# 安装依赖包
sudo apt-get install -y openjdk-8-jdk
 
# 添加Elasticsearch PPA
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
 
# 安装Elasticsearch
sudo apt-get install -y elasticsearch
 
# 启动Elasticsearch服务
sudo systemctl start elasticsearch.service
 
# 设置Elasticsearch服务开机自启
sudo systemctl enable elasticsearch.service
 
# 检查Elasticsearch服务状态
sudo systemctl status elasticsearch.service

这段代码首先更新了系统的包索引，然后安装了OpenJDK 8，并添加了Elasticsearch的PPA仓库。接着，安装了Elasticsearch，并启动了服务。最后，设置了Elasticsearch在开机时自动启动，并检查了服务的状态。这样就完成了在Ubuntu 16.04上安装和配置Elasticsearch的过程。

- 阅读更多 -

【Kubenates新增gpu节点调度】

System

2024-08-25

所有,elasticsearch

在Kubernetes中，要使得GPU节点能够调度，需要确保集群中安装了NVIDIA的GPU驱动和相关的device plugin。以下是一个简单的步骤指导和示例代码，用于确保GPU节点可以被Kubernetes调度。

确保GPU驱动安装正确。
确保Kubernetes集群中的kubelet配置了--feature-gates=Accelerators=true。
确保安装了NVIDIA的device plugin。

示例代码（在GPU节点上）:




# nvidia-device-plugin-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-daemonset
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-daemonset
    spec:
      containers:
      - name: nvidia-device-plugin-container
        image: nvidia/k8s-device-plugin:1.0.0-beta
        volumeMounts:
          - name: device-plugin-socket
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin-socket
          hostPath:
            path: /var/lib/kubelet/device-plugins

部署device plugin:




kubectl apply -f nvidia-device-plugin-daemonset.yaml

确保GPU资源在Pod规格中被请求：




apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:9.0-devel
    resources:
      limits:
        nvidia.com/gpu: 1 # 请求1个GPU

这样，Kubernetes集群就会调度GPU资源给请求它们的Pod。确保你的节点标签正确，以便调度器可以按期望的方式工作。

- 阅读更多 -

基于cmake为项目自动获取git分支tag的版本号和commitid

System

2024-08-25

所有,elasticsearch

要使用CMake自动获取Git分支的标签版本号和提交ID，你可以使用Git命令行工具和CMake的execute_process指令。以下是一个简单的CMake脚本示例，它将获取当前分支的最新标签和最新提交ID。




# 查找Git可执行文件的位置
find_package(Git)
 
# 确保Git可用
if(NOT GIT_FOUND)
  message(FATAL_ERROR "Git must be installed to get version information.")
endif()
 
# 定义函数来获取Git标签和提交ID
function(get_git_tag_and_commit_id VERSION_VAR COMMIT_ID_VAR)
  # 获取最新的标签
  execute_process(
    COMMAND ${GIT_EXECUTABLE} describe --tags --abbrev=0
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
    OUTPUT_VARIABLE ${VERSION_VAR}
    OUTPUT_STRIP_TRAILING_WHITESPACE
  )
 
  # 获取最新的提交ID
  execute_process(
    COMMAND ${GIT_EXECUTABLE} rev-parse HEAD
    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
    OUTPUT_VARIABLE ${COMMIT_ID_VAR}
    OUTPUT_STRIP_TRAILING_WHITESPACE
  )
endfunction()
 
# 调用函数，获取版本号和提交ID
get_git_tag_and_commit_id(PROJECT_VERSION PROJECT_COMMIT_ID)
 
# 打印结果
message(STATUS "Version: ${PROJECT_VERSION}")
message(STATUS "Commit ID: ${PROJECT_COMMIT_ID}")

在CMakeLists.txt中使用上述代码，它会自动获取当前Git仓库的最新标签和最新提交ID，并将它们存储在变量PROJECT_VERSION和PROJECT_COMMIT_ID中。然后，你可以在CMake配置中使用这些变量，例如在安装头文件中定义版本号。

- 阅读更多 -

Python 玩转 Elasticsearch的一些技巧与经验

System

2024-08-25

所有,elasticsearch




from datetime import datetime
import elasticsearch
 
# 连接到Elasticsearch
es = elasticsearch.Elasticsearch(hosts=['localhost:9200'])
 
# 创建一个新的Elasticsearch文档
def create_es_doc(index_name, doc_id, doc_data):
    doc = {
        'doc': doc_data,
        '_index': index_name,
        '_id': doc_id,
        '_source': doc_data
    }
    res = es.index(body=doc)
    print(f"Document {doc_id} created: {res['result']}")
 
# 更新Elasticsearch文档
def update_es_doc(index_name, doc_id, doc_data):
    doc = {
        'doc': doc_data,
        '_index': index_name,
        '_id': doc_id
    }
    res = es.update(body=doc)
    print(f"Document {doc_id} updated: {res['result']}")
 
# 获取Elasticsearch文档
def get_es_doc(index_name, doc_id):
    res = es.get(index=index_name, id=doc_id)
    print(f"Document {doc_id} retrieved: {res['_source']}")
 
# 删除Elasticsearch文档
def delete_es_doc(index_name, doc_id):
    res = es.delete(index=index_name, id=doc_id)
    print(f"Document {doc_id} deleted: {res['result']}")
 
# 创建一个新的Elasticsearch索引
def create_es_index(index_name):
    res = es.indices.create(index=index_name, ignore=400)
    print(f"Index {index_name} created: {res['acknowledged']}")
 
# 删除Elasticsearch索引
def delete_es_index(index_name):
    res = es.indices.delete(index=index_name, ignore=[400, 404])
    print(f"Index {index_name} deleted: {res['acknowledged']}")
 
# 使用示例
index_name = 'example_index'
doc_id = 'example_doc'
doc_data = {
    'title': 'Python Elasticsearch Example',
    'content': 'This is an example document for Elasticsearch',
    'date': datetime.now()
}
 
create_es_index(index_name)
create_es_doc(index_name, doc_id, doc_data)
update_es_doc(index_name, doc_id, doc_data)
get_es_doc(index_name, doc_id)
delete_es_doc(index_name, doc_id)
delete_es_index(index_name)

这段代码展示了如何使用Python和elasticsearch库来与Elasticsearch进行交互。代码中包含了创建索引、创建和更新文档、获取文档以及删除文档的基本操作，并提供了使用这些操作的示例。

- 阅读更多 -

Elasticsearch ：ES基本概念和原理简单介绍

System

2024-08-25

所有,elasticsearch

Elasticsearch是一个基于Lucene库的开源搜索和分析引擎，设计用于云计算中，能够达到实时搜索，高可用，扩展性和管理的复杂Heap大小数据。它提供了一个分布式多用户能力的全文搜索引擎，基于RESTful web接口。

Elasticsearch的基本概念和原理包括：

索引(Index): 一个Elasticsearch索引是一个文档的集合。
文档(Document): 一个Elasticsearch文档是一个可被索引的基本信息单元，类似于传统数据库中的一行数据。
类型(Type): 一个索引中的不同文档类型。
节点(Node): 运行Elasticsearch服务的服务器称为节点。
集群(Cluster): 由多个节点组成的网络，内部节点间通信，自动发现其他节点。
分片与副本(Shards & Replicas): 数据分布式存储和高可用性的方式。
分析器(Analyzers): 文本分析工具，用于文本分词和处理。

示例代码（使用Elasticsearch的Python客户端）：




from elasticsearch import Elasticsearch
 
# 连接到Elasticsearch
es = Elasticsearch("http://localhost:9200")
 
# 创建一个索引
es.indices.create(index='my_index', ignore=400)
 
# 添加一个文档到索引
doc = {
    'name': 'John Doe',
    'age': 30,
    'about': 'I love to go rock climbing'
}
res = es.index(index='my_index', id=1, document=doc)
 
# 搜索索引中的文档
res = es.search(index='my_index', query={'match': {'about': 'climbing'}})
 
print(res['hits']['hits'])

这段代码展示了如何使用Elasticsearch Python客户端连接到Elasticsearch服务，创建一个索引，添加一个文档，并执行一个基本的搜索查询。

- 阅读更多 -

从零手写实现 nginx-21-modules 模块

System

2024-08-25

所有,elasticsearch

实现一个Nginx模块涉及多个步骤，包括模块初始化、配置解析、日志记录等。以下是一个简单的Nginx模块骨架，它可以作为开始手写模块的起点。




#include <ngx_config.h>
#include <ngx_core.h>
#include <ngx_http.h>
 
// 模块的 context 结构体
typedef struct {
    ngx_str_t test_str;
} ngx_http_mytest_conf_t;
 
// 解析配置项的回调函数
static char *ngx_http_mytest(ngx_conf_t *cf, ngx_command_t *cmd, void *conf) {
    ngx_http_core_loc_conf_t *clcf;
    clcf = ngx_http_conf_get_module_loc_conf(cf, ngx_http_core_module);
    clcf->handler = ngx_http_mytest_handler;
 
    // 获取配置项参数
    ngx_str_t *value = cf->args->elts;
    ngx_http_mytest_conf_t *lcf = conf;
    lcf->test_str = value[1];
 
    return NGX_CONF_OK;
}
 
// 模块处理请求的回调函数
static ngx_int_t ngx_http_mytest_handler(ngx_http_request_t *r) {
    if (!(r->method & (NGX_HTTP_GET|NGX_HTTP_HEAD))) {
        return NGX_HTTP_NOT_ALLOWED;
    }
 
    ngx_int_t rc = ngx_http_discard_request_body(r);
    if (rc != NGX_OK) {
        return rc;
    }
 
    // 设置响应头
    ngx_str_t *test_str = ...; // 从配置中获取字符串
    ngx_table_elt_t *h = ngx_list_push(&r->headers_out.headers);
    if (h == NULL) {
        return NGX_HTTP_INTERNAL_SERVER_ERROR;
    }
 
    h->key = ngx_http_mytest_header_name;
    h->value = *test_str;
    h->hash = 1; // 计算hash值
 
    // 发送响应
    ngx_int_t status = NGX_HTTP_OK;
    ngx_str_t response = ngx_http_mytest_response;
    ngx_buf_t *b = ngx_create_temp_buf(r.pool, response.len);
    if (b == NULL) {
        return NGX_HTTP_INTERNAL_SERVER_ERROR;
    }
 
    ngx_memcpy(b->pos, response.data, response.len);
    b->last = b->pos + response.len;
    b->memory = 1;
    b->last_buf = 1;
 
    r->headers_out.status = status;
    r->headers_out.content_length_n = response.len;
    r->headers_out.content_type_len = sizeof("text/plain") - 1;
    ngx_str_set(&r->headers_out.content_type, "text/plain");
 
    return ngx_http_send_header(r) == NGX_ERROR
        ? NGX_HTTP_INTERNAL_SERVER_ERROR
        : ngx_http_output_filter(r, b);
}
 
// 模块的定义
ngx_module_t ngx_http_mytest_module = {
    ..., // 模块的各种方法
    ngx_http_mytest_create_conf, // 创建配置结构体的方法
    ngx_http_mytest_init_conf // 初始化配置结构体的方法
};
 
// 配置指令结构体
static ngx_command_t ngx_http_

System

2024-08-25

所有,elasticsearch

在Windows环境下，使用Elasticsearch进行Word, Excel, PDF文件的全文索引和检索，可以通过以下步骤实现：

安装Elasticsearch和Kibana。
设置Elasticsearch的ingest node功能，以支持文件附件的处理。
使用Logstash或者其他工具来处理文档文件，并将它们索引到Elasticsearch中。
使用Elasticsearch的查询API进行全文检索。

以下是一个简化的例子，展示如何使用Logstash索引Word文件：

安装Logstash。
创建一个Logstash配置文件，如logstash-simple.conf，用于Word文件的索引：




input {
  file {
    path => "C:\path\to\your\documents\*.docx"
    start_position => "beginning"
  }
}
 
filter {
  mutate {
    split => ["message", " "]
  }
  word_delimiter {
    generate_word_parts => true
    generate_number_parts => true
    catenate_all => true
  }
  lowercase
}
 
output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "word_index"
    document_type => "word_doc"
  }
}

运行Logstash：




logstash -f logstash-simple.conf

使用Elasticsearch的查询API进行全文检索。例如，使用curl查询：




curl -X GET "localhost:9200/word_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "message": "search text"
    }
  }
}
'

请注意，这个例子是一个简化的示例，实际应用中可能需要考虑更多的配置和细节。对于Excel和PDF文件，你可能需要不同的插件或自定义处理方式，因为它们的格式复杂性不同。对于PDF，可能需要使用专门的库来提取文本，而对于Excel，你可能需要将其转换为CSV格式再进行索引。

- 阅读更多 -

【聚类】K-modes和K-prototypes——适合离散数据的聚类方法

System

2024-08-25

所有,elasticsearch

K-modes和K-prototypes是两种常用的聚类方法，适用于离散数据。以下是Python中使用scikit-learn库实现K-modes和K-prototypes聚类的示例代码：




from sklearn.cluster import KMeans, k_prototypes
from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score
import numpy as np
 
# 生成示例数据
data = np.array([
    [1, 1, 1, 1, 1],
    [1, 1, 1, 2, 2],
    [1, 2, 2, 2, 2],
    [2, 2, 2, 2, 1],
    [2, 2, 2, 1, 1],
    [2, 1, 1, 1, 1]
])
 
# K-modes聚类
k_modes_clusterer = KMeans(n_clusters=3)
k_modes_clusterer.fit(data)
k_modes_labels = k_modes_clusterer.labels_
k_modes_centroids = k_modes_clusterer.cluster_centers_
 
# K-prototypes聚类
k_prototypes_clusterer = k_prototypes(n_clusters=3, data=data)
k_prototypes_labels = k_prototypes_clusterer.labels_
k_prototypes_centroids = k_prototypes_clusterer.cluster_centers_
 
# 评估聚类效果
print("K-modes AMI:", adjusted_mutual_info_score(np.argmax(data, axis=1), k_modes_labels))
print("K-modes ARI:", adjusted_rand_score(np.argmax(data, axis=1), k_modes_labels))
print("K-prototypes AMI:", adjusted_mutual_info_score(np.argmax(data, axis=1), k_prototypes_labels))
print("K-prototypes ARI:", adjusted_rand_score(np.argmax(data, axis=1), k_prototypes_labels))

这段代码首先导入必要的库，生成示例离散数据，然后使用K-modes和K-prototypes算法进行聚类，并使用调整后的互信息（AMI）和调整后的Rand指数（ARI）评估聚类效果。

- 阅读更多 -