+
82
-

回答

在Python中对长篇文章进行语义分块是一个涉及自然语言处理(NLP)技术的任务,以下是分步实现的详细方法和代码示例:

一、核心思路

语义分块的目标是将文本划分为 语义连贯的段落,而非简单的固定长度切割。主要方法分为两类:

基于规则的方法(快速但需领域适配)基于深度学习的方法(准确但计算成本较高)二、基于规则的分块方法1. 句子分割+上下文合并
import spacy

def semantic_chunking_rule(text, max_chunk_size=500):
    nlp = spacy.load("zh_core_web_sm")  # 中文模型
    doc = nlp(text)

    chunks = []
    current_chunk = []
    current_length = 0

    for sent in doc.sents:
        sent_length = len(sent.text)
        if current_length + sent_length <= max_chunk_size:
            current_chunk.append(sent.text)
            current_length += sent_length
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sent.text]
            current_length = sent_length

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# 使用示例
text = "长篇文章内容..."
chunks = semantic_chunking_rule(text)
2. 主题关键词分块
from collections import defaultdict

def keyword_based_chunking(text, keywords=["然而", "总之", "综上所述"]):
    chunks = []
    buffer = []

    for paragraph in text.split("\n"):
        buffer.append(paragraph)
        if any(keyword in paragraph for keyword in keywords):
            chunks.append("\n".join(buffer))
            buffer = []

    if buffer:
        chunks.append("\n".join(buffer))

    return chunks
三、基于深度学习的分块方法1. 使用Sentence Transformers计算相似度
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')

def semantic_split(text, threshold=0.85):
    paragraphs = [p for p in text.split("\n") if p.strip()]
    embeddings = model.encode(paragraphs)

    chunks = []
    current_chunk = [paragraphs[0]]

    for i in range(1, len(paragraphs)):
        similarity = np.dot(embeddings[i-1], embeddings[i])
        if similarity >= threshold:
            current_chunk.append(paragraphs[i])
        else:
            chunks.append("\n".join(current_chunk))
            current_chunk = [paragraphs[i]]

    chunks.append("\n".join(current_chunk))
    return chunks
2. 使用预训练语言模型(BERT)
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModel.from_pretrained("bert-base-chinese")

def bert_semantic_chunking(text, window_size=3, stride=2):
    sentences = [sent.text for sent in nlp(text).sents]

    chunks = []
    for i in range(0, len(sentences), stride):
        chunk = sentences[i:i+window_size]
        chunks.append(" ".join(chunk))

    # 可选:通过BERT计算chunk语义相似度进行合并
    return chunks
四、进阶方法:混合策略
def hybrid_chunking(text):
    # 第一阶段:基于规则粗分
    coarse_chunks = semantic_chunking_rule(text)

    # 第二阶段:基于语义精调
    final_chunks = []
    for chunk in coarse_chunks:
        if len(chunk) > 1000:
            sub_chunks = semantic_split(chunk)
            final_chunks.extend(sub_chunks)
        else:
            final_chunks.append(chunk)

    return final_chunks
五、关键优化技巧

动态分块大小

def dynamic_window(text, min_size=200, max_size=600):
    # 根据标点符号密度调整窗口
    pass

实体感知分块

def entity_aware_chunking(text):
    nlp = spacy.load("zh_core_web_sm")
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    # 确保实体不被分割

多粒度分块

def multi_scale_chunking(text):
    return {
        "sentence_level": [sent.text for sent in doc.sents],
        "paragraph_level": text.split("\n\n"),
        "section_level": re.split(r'\n第[一二三四]部分', text)
    }
六、评估指标
def evaluate_chunks(chunks):
    # 语义连贯性评估
    cohesion_scores = []
    for i in range(len(chunks)-1):
        emb1 = model.encode(chunks[i])
        emb2 = model.encode(chunks[i+1])
        cohesion_scores.append(np.dot(emb1, emb2))

    # 信息完整性评估
    original_emb = model.encode(text)
    chunk_embs = model.encode(chunks)
    reconstruction_score = np.mean([np.dot(original_emb, ce) for ce in chunk_embs])

    return {
        "avg_cohesion": np.mean(cohesion_scores),
        "reconstruction_score": reconstruction_score
    }
七、推荐工具库

LangChain(内置文本分割器):

from langchain.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()
text_splitter = SemanticChunker(embeddings)
docs = text_splitter.create_documents([text])

TextTiling(经典算法):

from nltk.tokenize import texttiling

tt = texttiling.TextTilingTokenizer()
chunks = tt.tokenize(text)
选择建议
基于规则快速原型开发<br>结构化文本速度快<br>易解释领域依赖性高
深度学习非结构化文本<br>高精度需求语义理解强计算成本高
混合策略生产环境平衡速度与精度实现复杂度高

根据实际需求选择合适的方法,建议从规则方法入手,逐步引入语义分析组件。对于RAG等需要高质量分块的应用,推荐使用Sentence Transformers结合动态窗口策略。

网友回复

我知道答案,我要回答