在Python中对长篇文章进行语义分块是一个涉及自然语言处理(NLP)技术的任务,以下是分步实现的详细方法和代码示例:
一、核心思路语义分块的目标是将文本划分为 语义连贯的段落,而非简单的固定长度切割。主要方法分为两类:
基于规则的方法(快速但需领域适配)基于深度学习的方法(准确但计算成本较高)二、基于规则的分块方法1. 句子分割+上下文合并import spacy
def semantic_chunking_rule(text, max_chunk_size=500):
nlp = spacy.load("zh_core_web_sm") # 中文模型
doc = nlp(text)
chunks = []
current_chunk = []
current_length = 0
for sent in doc.sents:
sent_length = len(sent.text)
if current_length + sent_length <= max_chunk_size:
current_chunk.append(sent.text)
current_length += sent_length
else:
chunks.append(" ".join(current_chunk))
current_chunk = [sent.text]
current_length = sent_length
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
# 使用示例
text = "长篇文章内容..."
chunks = semantic_chunking_rule(text) 2. 主题关键词分块 from collections import defaultdict
def keyword_based_chunking(text, keywords=["然而", "总之", "综上所述"]):
chunks = []
buffer = []
for paragraph in text.split("\n"):
buffer.append(paragraph)
if any(keyword in paragraph for keyword in keywords):
chunks.append("\n".join(buffer))
buffer = []
if buffer:
chunks.append("\n".join(buffer))
return chunks 三、基于深度学习的分块方法1. 使用Sentence Transformers计算相似度 from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
def semantic_split(text, threshold=0.85):
paragraphs = [p for p in text.split("\n") if p.strip()]
embeddings = model.encode(paragraphs)
chunks = []
current_chunk = [paragraphs[0]]
for i in range(1, len(paragraphs)):
similarity = np.dot(embeddings[i-1], embeddings[i])
if similarity >= threshold:
current_chunk.append(paragraphs[i])
else:
chunks.append("\n".join(current_chunk))
current_chunk = [paragraphs[i]]
chunks.append("\n".join(current_chunk))
return chunks 2. 使用预训练语言模型(BERT) from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModel.from_pretrained("bert-base-chinese")
def bert_semantic_chunking(text, window_size=3, stride=2):
sentences = [sent.text for sent in nlp(text).sents]
chunks = []
for i in range(0, len(sentences), stride):
chunk = sentences[i:i+window_size]
chunks.append(" ".join(chunk))
# 可选:通过BERT计算chunk语义相似度进行合并
return chunks 四、进阶方法:混合策略 def hybrid_chunking(text):
# 第一阶段:基于规则粗分
coarse_chunks = semantic_chunking_rule(text)
# 第二阶段:基于语义精调
final_chunks = []
for chunk in coarse_chunks:
if len(chunk) > 1000:
sub_chunks = semantic_split(chunk)
final_chunks.extend(sub_chunks)
else:
final_chunks.append(chunk)
return final_chunks 五、关键优化技巧动态分块大小:
def dynamic_window(text, min_size=200, max_size=600):
# 根据标点符号密度调整窗口
pass 实体感知分块:
def entity_aware_chunking(text):
nlp = spacy.load("zh_core_web_sm")
doc = nlp(text)
entities = [ent.text for ent in doc.ents]
# 确保实体不被分割 多粒度分块:
def multi_scale_chunking(text):
return {
"sentence_level": [sent.text for sent in doc.sents],
"paragraph_level": text.split("\n\n"),
"section_level": re.split(r'\n第[一二三四]部分', text)
} 六、评估指标 def evaluate_chunks(chunks):
# 语义连贯性评估
cohesion_scores = []
for i in range(len(chunks)-1):
emb1 = model.encode(chunks[i])
emb2 = model.encode(chunks[i+1])
cohesion_scores.append(np.dot(emb1, emb2))
# 信息完整性评估
original_emb = model.encode(text)
chunk_embs = model.encode(chunks)
reconstruction_score = np.mean([np.dot(original_emb, ce) for ce in chunk_embs])
return {
"avg_cohesion": np.mean(cohesion_scores),
"reconstruction_score": reconstruction_score
} 七、推荐工具库LangChain(内置文本分割器):
from langchain.text_splitter import SemanticChunker from langchain_community.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings() text_splitter = SemanticChunker(embeddings) docs = text_splitter.create_documents([text])
TextTiling(经典算法):
from nltk.tokenize import texttiling tt = texttiling.TextTilingTokenizer() chunks = tt.tokenize(text)选择建议
| 基于规则 | 快速原型开发<br>结构化文本 | 速度快<br>易解释 | 领域依赖性高 |
| 深度学习 | 非结构化文本<br>高精度需求 | 语义理解强 | 计算成本高 |
| 混合策略 | 生产环境 | 平衡速度与精度 | 实现复杂度高 |
根据实际需求选择合适的方法,建议从规则方法入手,逐步引入语义分析组件。对于RAG等需要高质量分块的应用,推荐使用Sentence Transformers结合动态窗口策略。
网友回复


