标签全文检索下的文章

2025-06-20

所有,elasticsearch

适合从事智能文档分析、法律科技、金融报告处理、企业知识搜索等行业开发者，构建能处理 PDF、Word、HTML、邮件等复杂文档的新一代搜索系统。

为什么需要处理复杂文档？
Unstructured.io 简介与优势
Elasticsearch 向量数据库简介
整体架构图解：复杂文档 → 搜索引擎
文档处理流程与向量生成
Elasticsearch 向量索引配置与搜索
完整实战代码示例：从文档到搜索结果
常见问题与性能优化
总结与推荐实践

一、为什么需要处理复杂文档？

企业中存在大量结构不清晰、跨格式的文档，如：

合同（PDF、DOCX）
技术手册（HTML、PPT）
邮件（.eml）
扫描件（OCR图像）

传统全文检索系统的难点：

格式繁多，解析复杂
内容结构嵌套，无法按段搜索
用户问题常以自然语言提出，需要语义匹配

因此，需要：

统一抽取内容
按段生成向量
在向量数据库中进行语义检索

二、Unstructured.io 简介与优势

Unstructured.io 是一个文档结构化开源工具，支持多种格式统一提取。

支持格式

类型	示例
文档	PDF, DOCX, PPTX
网页	HTML
邮件	.eml, .msg
图像	PNG, JPG（带OCR）

输出格式

每段内容被提取为 JSON 对象，附带元信息（位置、页码、类型等）：

{
  "type": "NarrativeText",
  "text": "本合同适用于...",
  "metadata": {
    "page_number": 3,
    "element_id": "uuid-1234"
  }
}

特点

基于分段（chunk）思想提取内容
自动识别结构：标题、表格、图像、正文等
可用于向量搜索预处理

三、Elasticsearch 向量数据库简介

Elasticsearch 自 8.x 起原生支持向量字段，支持：

精确 kNN 与近似 kNN（HNSW）
向量维度最大 2048
dense_vector 字段 + knn 查询

常配合 Embedding 模型实现语义搜索：

文本 → 向量（通过模型）
向量 → Elasticsearch 检索

四、整体架构图解（文字描述）

       +------------------+
       |  PDF/DOCX 文件等  |
       +--------+---------+
                ↓
       +------------------+
       |  Unstructured.io  |  ← 文档结构提取 & 分段
       +--------+---------+
                ↓
       +------------------+
       |   Embedding 模型  |  ← 将段落转为向量（如 BGE/MPNet）
       +--------+---------+
                ↓
       +------------------+
       | Elasticsearch 向量索引 |
       +------------------+
                ↓
       +------------------+
       | 自然语言查询 → 搜索 |
       +------------------+

五、文档处理流程与向量生成

5.1 使用 `unstructured` 提取文档结构

安装：

pip install unstructured

解析 PDF 示例：

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("contract.pdf")
for el in elements:
    print(el.text, el.metadata.page_number)

5.2 使用嵌入模型转向量

安装 HuggingFace 模型：

pip install sentence-transformers

示例：

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-base-zh")
vectors = [model.encode(el.text) for el in elements if el.text.strip()]

六、Elasticsearch 向量索引配置与搜索

6.1 映射配置

PUT /document_index
{
  "mappings": {
    "properties": {
      "text": { "type": "text" },
      "embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 128
        }
      },
      "page": { "type": "integer" },
      "file_id": { "type": "keyword" }
    }
  }
}

6.2 向量写入示例

from elasticsearch import Elasticsearch

es = Elasticsearch()

for i, el in enumerate(elements):
    if el.text.strip():
        doc = {
            "text": el.text,
            "embedding": vectors[i],
            "page": el.metadata.page_number,
            "file_id": "contract_2025"
        }
        es.index(index="document_index", document=doc)

七、完整实战代码流程（简化版）

from unstructured.partition.pdf import partition_pdf
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

# 文档提取
elements = partition_pdf("contract.pdf")

# 文本向量化
model = SentenceTransformer("BAAI/bge-base-zh")
texts = [el.text for el in elements if el.text.strip()]
vectors = model.encode(texts)

# 写入 Elasticsearch
es = Elasticsearch()

for i, el in enumerate(elements):
    if el.text.strip():
        es.index(index="document_index", document={
            "text": el.text,
            "embedding": vectors[i],
            "page": el.metadata.page_number,
            "file_id": "contract_2025"
        })

八、自然语言搜索示例

用户输入：“合同中关于违约责任的条款是什么？”

搜索代码

query = "违约责任条款"
query_vector = model.encode(query)

resp = es.search(index="document_index", knn={
    "field": "embedding",
    "query_vector": query_vector.tolist(),
    "k": 5,
    "num_candidates": 100
})

for hit in resp["hits"]["hits"]:
    print(hit["_score"], hit["_source"]["text"])

九、常见问题与优化建议

问题	原因	解决方式
查询不准	向量召回数过低	调大 `num_candidates`
PDF 无法读取结构	PDF 不规范	使用 `pdfplumber` 替代解析
写入慢	向量维度大 + 网络延迟	批量 bulk 写入
查询慢	`dense_vector` 无索引	设置 `index: true` 并使用 HNSW

十、总结与推荐实践

模块	推荐工具/配置
文档解析	`unstructured`
文本嵌入	`BAAI/bge-base-zh`, `sentence-transformers`
向量搜索	Elasticsearch 8.x + HNSW
查询接口	REST API / LangChain 接入
扩展能力	可集成 OCR、表格结构提取等模块

MongoDB 全文检索

2024-09-06

MongoDB提供了全文索引功能，可以使用$text操作符进行全文搜索。

以下是一个简单的例子，演示如何在MongoDB中创建全文索引并进行查询：




// 假设我们有一个名为articles的集合，并且我们想要对title字段建立全文索引
 
// 1. 创建全文索引
db.articles.createIndex({ title: "text" });
 
// 2. 使用$text进行全文搜索
db.articles.find({ $text: { $search: "java spring" } });
 
// 这将返回所有title字段中包含"java"和"spring"这两个词的文档

注意：

在创建全文索引之前，确保所有需要搜索的字段都是文本类型（即字符串）。
全文索引可能不区分大小写，具体取决于服务器的配置。
全文索引只适用于英文单词的分词。对于非英文或者需要更复杂的分词支持，可以考虑使用专业的全文搜索引擎如Elasticsearch。

2024-09-06

在Elasticsearch中，Query DSL（Domain Specific Language）允许你创建各种复杂的查询。以下是一些常见的Query DSL查询类型：

term查询：用于查找指定字段完全匹配的文档。
match查询：用于全文搜索，它先分析查询字符串，然后查找匹配的文档。
highlight查询：用于突出显示匹配查询的文档部分。

term查询




GET /_search
{
  "query": {
    "term": {
      "user.id": "kimchy"
    }
  }
}

match查询




GET /_search
{
  "query": {
    "match": {
      "message": "quick brown fox"
    }
  }
}

highlight查询




GET /_search
{
  "query": {
    "match": {
      "message": "quick brown fox"
    }
  },
  "highlight": {
    "fields": {
      "message": {}
    }
  }
}

以上代码示例展示了如何在Elasticsearch中使用term、match和highlight查询。term适用于确切值的查找，match适用于全文搜索，而highlight用于在搜索结果中突出显示匹配的文本片段。

Springboot整合全文检索引擎Lucene

2024-09-06

所有,java

Spring Boot整合Lucene创建索引和搜索的基本步骤如下：

添加依赖到pom.xml：




<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>YOUR_LUCENE_VERSION</version>
</dependency>

创建索引：




@Component
public class LuceneIndexer {
 
    private static final String INDEX_DIR = "path/to/index/dir";
 
    public void index(List<DocumentData> documents) throws IOException {
        Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
        IndexWriterConfig config = new IndexWriterConfig();
        IndexWriter writer = new IndexWriter(dir, config);
 
        for (DocumentData document : documents) {
            Document luceneDocument = new Document();
            luceneDocument.add(new StringField("id", document.getId(), Field.Store.YES));
            luceneDocument.add(new TextField("content", document.getContent(), Field.Store.YES));
            writer.addDocument(luceneDocument);
        }
 
        writer.close();
    }
}

搜索索引：




@Component
public class LuceneSearcher {
 
    public List<SearchResult> search(String query) throws IOException {
        List<SearchResult> results = new ArrayList<>();
        Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
 
        QueryParser parser = new QueryParser("content", new StandardAnalyzer());
        Query luceneQuery = parser.parse(query);
 
        TopDocs topDocs = searcher.search(luceneQuery, 10);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
 
        for (ScoreDoc scoreDoc : scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            SearchResult result = new SearchResult();
            result.setId(doc.get("id"));
            result.setContent(doc.get("content"));
            results.add(result);
        }
 
        reader.close();
        return results;
    }
}

使用Spring Boot的命令行运行器来创建和搜索索引：




@SpringBootApplication
public class LuceneApplication implements CommandLineRunner {
 
    @Autowired
    private LuceneIndexer indexer;
    @Autowired
    private LuceneSearcher searcher;
 
    public static void main(String[] args) {
        SpringApplication.run(LuceneApplication.class, args);
    }
 
    @Override
    public void run(String... args) throws Exception {

2024-09-02

所有,java

在Spring Boot项目中，使用Spring Data Elasticsearch实现全文检索并支持高亮的示例代码如下：




import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Page;
import org.springframework.data.domain.PageRequest;
import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;
import org.springframework.data.elasticsearch.core.query.HighlightQuery;
import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;
import org.springframework.data.elasticsearch.core.query.SearchQuery;
import org.springframework.stereotype.Service;
 
import java.util.List;
import java.util.Map;
 
@Service
public class SearchService {
 
    @Autowired
    private ElasticsearchTemplate elasticsearchTemplate;
 
    public Page<Item> search(String query, int page, int size) {
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withQuery(queryBuilder().keyword().field("fieldName").text(query))
                .withHighlightFields(
                        new HighlightBuilder.Field("fieldName").preTags("<em>").postTags("</em>")
                )
                .withPageable(PageRequest.of(page, size))
                .build();
 
        SearchHits<Item> searchHits = elasticsearchTemplate.search(searchQuery, Item.class);
        List<Item> content = searchHits.getSearchHits().stream()
                .map(searchHit -> {
                    Item item = searchHit.getContent();
                    Map<String, List<String>> highlightFields = searchHit.getHighlightFields();
                    if (highlightFields.containsKey("fieldName")) {
                        item.setFieldName(highlightFields.get("fieldName").get(0));
                    }
                    return item;
                })
                .toList();
 
        return new PageImpl<>(content, PageRequest.of(page, size), searchHits.getTotalHits());
    }
}
 
class Item {
    // 实体类属性和方法
    private String fieldName;
 
    public String getFieldName() {
        return fieldName;
    }
 
    public void setFieldName(String fieldName) {
        this.fieldName = fieldName;
    }
 
    // 其他属性和方法
}

在这个例子中，我们定义了一个SearchService服务类，其中的search方法构建了一个NativeSearchQueryBuilder来执行基于关键词的搜索，并通过withHighlightFields启用了字段的高亮。然后，我们使用elasticsearchTemplate.search方法执行搜索，并处理返回的结果，将高亮的文本设置回实体对象中。

请注意，这个例子假设你已经有了一个Elasticsearch节

MongoDB全文检索: 助力快速精准的文本搜索

2024-09-02




from pymongo import MongoClient
 
# 连接到MongoDB
client = MongoClient('mongodb://localhost:27017/')
db = client['text_search']
 
# 创建全文索引
db.articles.create_index([("$**", "text")])
 
# 执行全文搜索
search_results = db.articles.find({"$text": {"$search": "Python MongoDB"}})
 
# 输出搜索结果
for result in search_results:
    print(result)

这段代码演示了如何在MongoDB中为一个叫做articles的集合创建全文索引，并执行一个包含关键词"Python"和"MongoDB"的全文搜索查询。代码首先连接到本地运行的MongoDB实例，然后创建索引，最后执行搜索并打印结果。这个例子简单明了，并且注重于教授基本的全文搜索技巧。

2024-09-02




import sqlite3
import sys
 
# 连接到SQLite数据库
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
 
# 创建FTS5虚拟表，使用simple分词器，并建立全文索引
cursor.execute('''
DROP TABLE IF EXISTS fts;
CREATE VIRTUAL TABLE fts USING fts5(
    content TEXT,
    tokenize=simple
);
''')
 
# 向虚拟表中插入数据
cursor.execute('INSERT INTO fts(content) VALUES (?)', ('你好 世界',))
cursor.execute('INSERT INTO fts(content) VALUES (?)', ('你好 世界 编程',))
 
# 执行全文检索
cursor.execute('SELECT * FROM fts WHERE fts MATCH ?', ('"你好" OR "世界"',))
 
# 打印查询结果
for row in cursor.fetchall():
    print(row)
 
# 提交事务并关闭连接
conn.commit()
conn.close()

这段代码演示了如何在SQLite中使用FTS5和simple分词器进行中文拼音的全文检索。首先，它创建了一个FTS5虚拟表，并指定了使用simple分词器。然后，向虚拟表中插入了两条记录。最后，执行了一个全文检索查询，搜索包含“你好”或“世界”的记录。

[史上最全] python Django如何实现全文检索[一看就会]

2024-09-01

在Django中实现全文检索功能，可以使用Django的内置模型管理器Manager，或者使用外部库如Haystack来集成全文搜索引擎，如Elasticsearch或Whoosh。以下是使用Haystack和Whoosh的一个基本示例：

安装必要的包：




pip install django
pip install haystack
pip install whoosh

在Django项目的settings.py中添加haystack和whoosh引擎：




INSTALLED_APPS = [
    # ...
    'haystack',
    # ...
]
 
# 全文搜索配置
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
        'PATH': os.path.join(BASE_DIR, 'whoosh_index'),
    },
}

创建Whoosh搜索索引：

在应用目录下创建search_indexes.py文件，并定义模型的索引：




import datetime
from haystack import indexes
from myapp.models import MyModel
 
class MyModelIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
 
    def get_model(self):
        return MyModel
 
    def index_queryset(self, using=None):
        return self.get_model().objects.all()

在同目录下创建my_model_text.txt模板，用于指定索引哪些字段：




{{ object.field1 }}
{{ object.field2 }}
# 可以添加更多字段

运行命令创建索引：




python manage.py rebuild_index

在视图中使用搜索：




from haystack.forms import SearchForm
from django.template.response import TemplateResponse
 
def search(request):
    form = SearchForm(request.GET, load_all=True)
    results = form.search()
    return TemplateResponse(request, 'search.html', {'form': form, 'results': results})

创建搜索表单模板search.html：




<form method="get" action="/search/">
    <input type="text" name="q" />
    <input type="submit" value="Search" />
</form>
{% if results %}
    <ul>
        {% for result in results %}
            <li><a href="{{ result.object.get_absolute_url }}">{{ result.object.title }}</a></li>
        {% empty %}
            <li>No results found.</li>
        {% endfor %}
    </ul>
{% endif %}

这样就可以在Django应用中实现全文检索功能了。记得根据实际模型字段调整search_indexes.py和my_model_text.txt中的代码。

2024-08-26

所有,elasticsearch




from datetime import datetime
from elasticsearch import Elasticsearch
 
# 假设已经有了Elasticsearch客户端的连接实例es
es = Elasticsearch("http://localhost:9200")
 
# 创建一个新的索引库
def create_index(index_name):
    body = {
        "mappings": {
            "properties": {
                "title": {
                    "type": "text"
                },
                "content": {
                    "type": "text"
                },
                "create_date": {
                    "type": "date",
                    "format": "yyyy-MM-dd HH:mm:ss"
                }
            }
        }
    }
    response = es.indices.create(index=index_name, body=body)
    print(response)
 
# 添加文档到索引库
def add_document(index_name, title, content, create_date=None):
    if not create_date:
        create_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    document = {
        "title": title,
        "content": content,
        "create_date": create_date
    }
    response = es.index(index=index_name, body=document)
    print(response)
 
# 使用函数
create_index("my_index")
add_document("my_index", "Document Title", "This is the document content.", "2023-01-01 12:00:00")

这段代码首先定义了一个创建Elasticsearch索引的函数create_index，它接收一个索引库名称作为参数，并创建一个新的索引库，其中定义了文档的字段和类型。然后定义了一个添加文档到索引库的函数add_document，它接收索引库名称、标题、内容和创建日期作为参数，并将文档添加到指定的索引库中。最后，演示了如何使用这两个函数来创建一个索引库并添加一个文档。

2024-08-26

所有,html

要在ElasticSearch中实现对多种文件格式的全文检索，你可以使用ElasticSearch的ingest node功能和相关插件，如ingest-attachment。以下是一个基本的步骤和示例代码，展示如何配置ElasticSearch以索引并搜索附件文件内容。

确保你的ElasticSearch集群配置了ingest node。
安装ingest-attachment插件。
创建一个index template，定义文档的mappings和ingest pipeline。
使用ingest pipeline索引文档。
执行全文搜索。

以下是相关的配置和索引文档的示例代码：




# 1. 创建ingest pipeline
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1,
        "ignore_missing": true
      }
    }
  ]
}
 
# 2. 创建index template
PUT _template/attachment_template
{
  "index_patterns": ["*"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "dynamic": "true",
    "properties": {
      "data": {
        "type": "text",
        "fielddata": true
      }
    }
  },
  "ingest_pipeline": "attachment"
}
 
# 3. 索引文档 (以PDF文件为例)
POST /my_attachments/_doc?pipeline=attachment
{
  "data": {
    "value": "/path/to/your/document.pdf"
  }
}
 
# 4. 搜索文档
GET /my_attachments/_search
{
  "query": {
    "match": {
      "data": "search text"
    }
  }
}

请注意，你需要替换/path/to/your/document.pdf为你要索引的文件的实际路径。ingest-attachment插件将自动解析文件并索引其内容，使其可以用于全文搜索。

确保你的ElasticSearch集群有足够的资源来处理大型文件的解析和索引，因为这个过程可能会消耗大量内存和CPU资源。