在Python中对长篇文章进行语义分块是一个涉及自然语言处理(NLP)技术的任务,以下是分步实现的详细方法和代码示例:
一、核心思路语义分块的目标是将文本划分为 语义连贯的段落,而非简单的固定长度切割。主要方法分为两类:
基于规则的方法(快速但需领域适配)基于深度学习的方法(准确但计算成本较高)二、基于规则的分块方法1. 句子分割+上下文合并import spacy def semantic_chunking_rule(text, max_chunk_size=500): nlp = spacy.load("zh_core_web_sm") # 中文模型 doc = nlp(text) chunks = [] current_chunk = [] current_length = 0 for sent in doc.sents: sent_length = len(sent.text) if current_length + sent_length <= max_chunk_size: current_chunk.append(sent.text) current_length += sent_length else: chunks.append(" ".join(current_chunk)) current_chunk = [sent.text] current_length = sent_length if current_chunk: chunks.append(" ".join(current_chunk)) return chunks # 使用示例 text = "长篇文章内容..." chunks = semantic_chunking_rule(text)2. 主题关键词分块
from collections import defaultdict def keyword_based_chunking(text, keywords=["然而", "总之", "综上所述"]): chunks = [] buffer = [] for paragraph in text.split("\n"): buffer.append(paragraph) if any(keyword in paragraph for keyword in keywords): chunks.append("\n".join(buffer)) buffer = [] if buffer: chunks.append("\n".join(buffer)) return chunks三、基于深度学习的分块方法1. 使用Sentence Transformers计算相似度
from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') def semantic_split(text, threshold=0.85): paragraphs = [p for p in text.split("\n") if p.strip()] embeddings = model.encode(paragraphs) chunks = [] current_chunk = [paragraphs[0]] for i in range(1, len(paragraphs)): similarity = np.dot(embeddings[i-1], embeddings[i]) if similarity >= threshold: current_chunk.append(paragraphs[i]) else: chunks.append("\n".join(current_chunk)) current_chunk = [paragraphs[i]] chunks.append("\n".join(current_chunk)) return chunks2. 使用预训练语言模型(BERT)
from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese") model = AutoModel.from_pretrained("bert-base-chinese") def bert_semantic_chunking(text, window_size=3, stride=2): sentences = [sent.text for sent in nlp(text).sents] chunks = [] for i in range(0, len(sentences), stride): chunk = sentences[i:i+window_size] chunks.append(" ".join(chunk)) # 可选:通过BERT计算chunk语义相似度进行合并 return chunks四、进阶方法:混合策略
def hybrid_chunking(text): # 第一阶段:基于规则粗分 coarse_chunks = semantic_chunking_rule(text) # 第二阶段:基于语义精调 final_chunks = [] for chunk in coarse_chunks: if len(chunk) > 1000: sub_chunks = semantic_split(chunk) final_chunks.extend(sub_chunks) else: final_chunks.append(chunk) return final_chunks五、关键优化技巧
动态分块大小:
def dynamic_window(text, min_size=200, max_size=600): # 根据标点符号密度调整窗口 pass
实体感知分块:
def entity_aware_chunking(text): nlp = spacy.load("zh_core_web_sm") doc = nlp(text) entities = [ent.text for ent in doc.ents] # 确保实体不被分割
多粒度分块:
def multi_scale_chunking(text): return { "sentence_level": [sent.text for sent in doc.sents], "paragraph_level": text.split("\n\n"), "section_level": re.split(r'\n第[一二三四]部分', text) }六、评估指标
def evaluate_chunks(chunks): # 语义连贯性评估 cohesion_scores = [] for i in range(len(chunks)-1): emb1 = model.encode(chunks[i]) emb2 = model.encode(chunks[i+1]) cohesion_scores.append(np.dot(emb1, emb2)) # 信息完整性评估 original_emb = model.encode(text) chunk_embs = model.encode(chunks) reconstruction_score = np.mean([np.dot(original_emb, ce) for ce in chunk_embs]) return { "avg_cohesion": np.mean(cohesion_scores), "reconstruction_score": reconstruction_score }七、推荐工具库
LangChain(内置文本分割器):
from langchain.text_splitter import SemanticChunker from langchain_community.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings() text_splitter = SemanticChunker(embeddings) docs = text_splitter.create_documents([text])
TextTiling(经典算法):
from nltk.tokenize import texttiling tt = texttiling.TextTilingTokenizer() chunks = tt.tokenize(text)选择建议
基于规则 | 快速原型开发<br>结构化文本 | 速度快<br>易解释 | 领域依赖性高 |
深度学习 | 非结构化文本<br>高精度需求 | 语义理解强 | 计算成本高 |
混合策略 | 生产环境 | 平衡速度与精度 | 实现复杂度高 |
根据实际需求选择合适的方法,建议从规则方法入手,逐步引入语义分析组件。对于RAG等需要高质量分块的应用,推荐使用Sentence Transformers结合动态窗口策略。
网友回复
腾讯混元模型广场里都是混元模型的垂直小模型,如何api调用?
为啥所有的照片分辨率提升工具都会修改照片上的图案细节?
js如何在浏览器中将webm视频的声音分离为单独音频?
微信小程序如何播放第三方域名url的mp4视频?
ai多模态大模型能实时识别视频中的手语为文字吗?
如何远程调试别人的chrome浏览器获取调试信息?
为啥js打开新网页window.open设置窗口宽高无效?
浏览器中js的navigator.mediaDevices.getDisplayMedia屏幕录像无法录制SpeechSynthesisUtterance产生的说话声音?
js中mediaRecorder如何录制window.speechSynthesis声音音频并下载?
python如何直接获取抖音短视频的音频文件url?