We need to talk about RAG chunk size. I’ve seen too many developers throw a Large Language Model at a pile of proprietary data and expect magic. Specifically, they assume the model will “just figure it out.” However, the bottleneck isn’t usually the LLM—it’s the retrieval logic. If you feed your vector database garbage fragments, you’ll get garbage answers.
I recently reviewed a project where the retrieval was consistently failing on simple distinctions. The system couldn’t tell the difference between “Project Alpha” and “Project Beta” because the text was split mid-sentence. It’s a classic gotcha. When your RAG chunk size is misconfigured, you lose the semantic context that makes the vector search effective in the first place.
The Granularity Trap: Small vs. Large Chunks
In a typical Retrieval-Augmented Generation pipeline, you aren’t searching the whole document. You’re searching “chunks.” The size of these units—whether measured in characters, tokens, or words—determines what the embedding model actually “sees.”
- Small Chunks (e.g., 80 chars): These are highly specific but suffer from extreme context loss. They often return sentence fragments that are semantically useless.
- Medium Chunks (e.g., 220 chars): Often the “goldilocks” zone, but they can create dangerous ambiguities. As highlighted in recent experiments, medium chunks can result in nearly identical cosine similarity scores for very different answers.
- Large Chunks (e.g., 500+ chars): These provide robust context and stable rankings. Consequently, they are less precise. You risk retrieving a “wall of text” that contains the answer along with irrelevant noise.
If you’re integrating AI into a WordPress environment—perhaps via a custom knowledge base for WooCommerce—you need a reliable way to split your content. Below is a naive approach I see often, followed by a more robust recursive strategy.
Refactoring Your Text Splitting Logic
Don’t just use substr(). It’s a performance bottleneck for your AI’s accuracy because it ignores word boundaries. Specifically, it hacks sentences in half. Therefore, you should use a recursive splitter that respects whitespace or punctuation.
<?php
/**
* Naive Approach: The "Context Killer"
* This method often splits words in half, ruining the embedding.
*/
function bbioon_naive_split($text, $size) {
return str_split($text, $size);
}
/**
* Senior Approach: Recursive Character Splitting
* Respects boundaries to maintain RAG chunk size integrity.
*/
function bbioon_recursive_split($text, $max_size, $overlap = 50) {
$chunks = [];
$text_length = strlen($text);
$pointer = 0;
while ($pointer < $text_length) {
$chunk = substr($text, $pointer, $max_size);
// Find the last space to avoid cutting words
$last_space = strrpos($chunk, ' ');
if ($last_space !== false && ($pointer + $max_size) < $text_length) {
$chunk = substr($chunk, 0, $last_space);
}
$chunks[] = trim($chunk);
$pointer += (strlen($chunk) - $overlap);
}
return $chunks;
}
Why Cosine Similarity Isn’t a Measure of Correctness
A common mistake is treating the similarity score as a “confidence” metric. It’s not. It is simply a measure of relative proximity in vector space. When multiple chunks have nearly identical scores—say 0.873 vs 0.874—the system is essentially guessing. This “instability” is exactly why RAG chunk size is an experimental variable, not a fixed constant. Furthermore, larger chunks often separate these scores more clearly, making the Top-1 result more reliable.
For more on building robust AI workflows, check out my guide on how to master your robust custom AI assistant or explore Cohere’s documentation on chunking strategies.
Look, if this RAG chunk size stuff is eating up your dev hours, let me handle it. I’ve been wrestling with WordPress since the 4.x days.
Final Takeaway: Test, Don’t Guess
There is no “perfect” chunk size. It depends entirely on your data structure. If your documentation is bullet-point heavy, small chunks might work. If it’s narrative-heavy, you need 500+ characters. Stop guessing and start running experiments on your retrieval rankings. Refactor your splitter, adjust your overlap, and monitor those similarity deltas.
Leave a Reply