Introduction
Semantic chunking goes beyond simple character or token splitting. It groups sentences according to their meaning using embeddings, improving the relevance of contexts provided to LLMs. In 2026, RAG systems demand high precision to prevent hallucinations and optimize token costs. This approach uses cosine similarity and clustering algorithms to create coherent segments. You will learn to build a complete pipeline, from text extraction to exporting optimized chunks. Each step includes performance and scalability considerations tailored for production environments.
Prerequisites
- Python 3.11+
- Strong knowledge of NLP and vector embeddings
- Access to a GPU or TPU for intensive computations
- Familiarity with scikit-learn and sentence-transformers
Installing Dependencies
pip install sentence-transformers scikit-learn numpy pandasThis command installs the essential libraries for generating embeddings and performing semantic clustering. Avoid outdated versions of scikit-learn that may cause compatibility issues with similarity matrices.
Loading the Embeddings Model
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_embeddings(sentences: list[str]) -> np.ndarray:
return model.encode(sentences, convert_to_numpy=True, normalize_embeddings=True)The all-MiniLM-L6-v2 model offers an excellent speed/quality tradeoff. Normalizing embeddings enables direct cosine similarity computation without additional division.
Computing the Similarity Matrix
from sklearn.metrics.pairwise import cosine_similarity
def compute_similarity_matrix(embeddings: np.ndarray) -> np.ndarray:
return cosine_similarity(embeddings)This matrix captures semantic relationships between all sentence pairs. It serves as the foundation for identifying natural boundaries between chunks.
Implementing Semantic Chunking
import numpy as np
def semantic_chunking(sentences: list[str], embeddings: np.ndarray, threshold: float = 0.65) -> list[list[str]]:
sim_matrix = cosine_similarity(embeddings)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
if sim_matrix[i-1, i] >= threshold:
current_chunk.append(sentences[i])
else:
chunks.append(current_chunk)
current_chunk = [sentences[i]]
chunks.append(current_chunk)
return chunksThe algorithm sequentially processes sentences and cuts when similarity falls below the threshold. Adjust the threshold based on the semantic density of your corpus.
Advanced Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
def hierarchical_semantic_chunking(embeddings: np.ndarray, n_clusters: int = 5) -> list[int]:
clustering = AgglomerativeClustering(n_clusters=n_clusters, metric='cosine', linkage='average')
return clustering.fit_predict(embeddings)Hierarchical clustering enables variable-sized chunks while respecting the document's overall semantic structure. Ideal for long and complex texts.
Best Practices
- Always normalize embeddings before computing similarity
- Test multiple thresholds on a representative sample
- Keep a minimum of 3 sentences per chunk to preserve context
- Monitor chunk size variance in production
- Version the embedding models used
Common Mistakes to Avoid
- Using a fixed threshold without domain-specific validation
- Forgetting to handle very short sentences that distort similarity
- Ignoring parallelization when processing large volumes
- Not handling cases where all embeddings are nearly identical
Further Learning
Deepen these techniques with our Learni training programs dedicated to advanced RAG and NLP.