Introducing Out of Vocabulary
Things I type to LLMs that seem worth sharing — with other LLMs and the occasional human.
This is a blog about building with AI. It's called "Out of Vocabulary," and before I get into what it's about and where it's going, I want to tell you the story of how it got its name — because it's also a story about an early lesson in building AI systems.
Where this started
About a year and a half ago, I decided to learn how to build with LLMs and picked a problem I was genuinely curious about. This was actually my second attempt — I'd first tried to build something with CrewAI and moved on pretty quickly. So I started over in LangGraph with the problem: could you automatically analyze earnings call transcripts to determine whether they supported or refuted a set of investment hypotheses?
The use case was straightforward. If you’re an investor following a particular company — say Nvidia — you listen to their earnings call and those of related companies each quarter. But during earnings season, there are 30 to 50 calls a night across a huge range of companies. Buried in those calls might be small tidbits — a passing comment about AI deployment spending, a mention of shifting infrastructure budgets — that are directly relevant to your Nvidia thesis. You’d never catch them all manually. What if an AI agent could?
The naive approach was obvious: paste the transcript and your hypotheses into ChatGPT and ask what it finds. This worked — sort of. It would surface the most prominent evidence for the most prominent hypotheses. But if something relevant to hypothesis three was buried in a single passing comment deep in the call, while hypotheses one and two were discussed extensively throughout, the model would get distracted by the dominant signal and skip the quieter one entirely. (Keep in mind — this was mid-2024. Models have gotten meaningfully better at this kind of task since then.)
Decomposing the problem
The idea that clicked for me early in working with LangGraph was the Send() API — essentially a map/reduce pattern for LLM analysis. Instead of asking one model to evaluate an entire transcript against all ten hypotheses at once, I could fan the work out: send the transcript to ten parallel agents, each focused on a single hypothesis. This alone was a big improvement — each agent had a narrower task and a clearer focus.
But the transcripts were long — roughly an hour of conversation — and even with the per-hypothesis decomposition, I was still seeing the same attention problem at a smaller scale. A transcript might hammer home one piece of evidence for hypothesis one throughout the call, but mention a second relevant point only once, deep in a Q&A response. The model would miss it.
The natural next step was to break the transcript itself into logical sections and analyze each one independently. Earnings calls have a pretty consistent structure: opening remarks, CEO commentary, CFO financial review, then a long Q&A section where a moderator calls on analysts one at a time. If I could chunk the transcript along those natural boundaries, each chunk would be small and focused enough that the model wouldn’t lose the thread.
The chunking problem
This turned out to be much harder than I expected. At the time — this was mid-2024, peak RAG era — the standard approaches to chunking were designed for retrieval pipelines: split by token count, maybe with some overlap, maybe try to break at sentence or paragraph boundaries. These were fine for RAG but weren’t designed for what I needed. They might split an answer into two chunks where the second half builds on the first — and in doing so, lose the context of what question was being answered. Or they’d lump a question in with the previous response instead of the following one, putting the context in exactly the wrong place. The tools were just looking for sentence ends and paragraph breaks — they had no concept of the semantic structure of the conversation.
So I thought: why not have an LLM do the chunking? Tell it how earnings calls are structured, ask it to identify the logical sections. I’d seen Greg Kamradt’s “5 Levels of Text Splitting” framework making the rounds — the top level was what he called agentic chunking, where you essentially ask an LLM to be the chunker. That seemed like exactly the right approach.
It was — in theory. In practice, I hit two problems that taught me something important about how LLMs interact with text.
Problem one: reproduction fidelity. When I asked the LLM to return each chunk as text, it would subtly alter the transcript. Nothing dramatic — it might remove extra line breaks, split a run-on sentence into two, clean up formatting. Honestly, the changes often improved the text. But I was using these chunks as source material for downstream analysis, and I needed them to be the exact original text. Even small changes meant I couldn’t guarantee fidelity, and at scale I was worried those small changes might occasionally become bigger ones.
Problem two: counting. The obvious workaround was to have the LLM tell me where each section started and ended — character positions — so I could slice the original transcript myself. But LLMs are terrible at counting characters. They’d be off by dozens or hundreds of positions, and the resulting slices were useless.
I tried a bunch of approaches. Having the LLM reproduce chunks and then diffing them against the original. Having it identify sections by their first and last sentences and using string matching. Iterative feedback loops where I’d show it the diffs and ask it to correct them. That last one was particularly frustrating — it would fix three errors and introduce three different ones, endlessly cycling.
The spaCy trick
The breakthrough came from a different part of the AI world entirely.
I’d been thinking about how existing chunking tools worked — looking for periods, line breaks, punctuation — and their limitations. Abbreviations like “Mr.” would trip them up, creating false sentence boundaries. What I needed was something that could reliably identify sentence boundaries in messy, real-world text.
That led me to spaCy. For those unfamiliar, spaCy is a natural language processing library — a tool from the pre-LLM era of AI that handles tasks like tokenization, part-of-speech tagging, and sentence boundary detection. It uses pre-trained models that understand the structure of language at a grammatical level, not a statistical-next-token-prediction level.
What I did was simple: I used spaCy to identify every sentence in the transcript, then prepended a number to each one. So the LLM didn’t see raw text — it saw:
[1] Good afternoon, and welcome to the Q3 earnings call. [2] I’m joined today by our CEO and CFO. [3] Before we begin, I’d like to remind everyone that...
Now when I asked the LLM to identify the logical sections of the call, it didn’t need to reproduce any text or count any characters. It just had to say: “The CEO’s opening remarks run from sentence 14 to sentence 47. The CFO section runs from sentence 48 to sentence 103.” The LLM was anchoring to labels that were already in the text rather than trying to do spatial reasoning it’s fundamentally not built for.
This worked remarkably well. The chunking became fast and reliable. I could enforce guardrails — verify that every sentence number appeared in exactly one chunk, flag any gaps. The downstream analysis improved dramatically because each chunk was now a coherent, complete section of the conversation, and the text sent to the analysis agents was the exact original transcript, not an LLM’s approximation of it.
Two worlds in conversation
What I love about this solution is that it came from putting two different disciplines into conversation with each other. LLMs and traditional NLP come from related but distinct lineages within AI. The LLM brought the semantic understanding — it knew what a CEO’s opening remarks looked like and where one topic ended and another began. SpaCy brought the structural precision — it could reliably decompose text into sentences in a way that was deterministic and exact. Neither could solve the problem alone. Together, they nailed it.
It reminds me of the most interesting classes I took in college — the ones cross-listed across departments. I took a class on psychology and economics that didn’t go particularly deep on either discipline in isolation, but taught you to see how insights from one could reshape assumptions in the other. Building with AI feels like that constantly. The best solutions often come from pulling in ideas that weren’t designed for your specific problem but turn out to be exactly what you need.
And that's where the name comes from
While I was deep in spaCy’s documentation, I ran into the term “out of vocabulary” — it refers to tokens that a model encounters which weren’t present in its training data. Words it’s never seen before. The model has to figure out what to do with something it has no prior knowledge of.
I liked the resonance. This blog is about building with AI, and a lot of what I want to write about are ideas that feel, at least to me, a little out of vocabulary — things that maybe haven’t been discussed enough, or approaches that combine familiar concepts in unfamiliar ways, or lessons learned from building things where the playbook doesn’t exist yet.
What this is
I’m going to keep this simple. “Out of Vocabulary” is where I’ll share things I’ve learned from building AI systems — ideas, patterns, mistakes, and the occasional strong opinion. Some posts will be technical. Some will be more strategic. All of them started as something I typed to an LLM that seemed worth sharing more broadly.
If I’m being honest, I almost didn’t publish any of this. When you build with LLMs, you spend a lot of time getting enthusiastic validation from your AI collaborator, and after a while you start to wonder: is this actually interesting, or has the sycophancy just convinced me it’s better than it is? I genuinely don’t know. But I figure the only way to find out is to put it out there and see if it resonates with anyone who isn’t trained to be encouraging.
So — welcome.
