Out of Vocabulary

The Context Bottleneck

Out of Vocabulary — Sat, 11 Apr 2026 00:00:03 GMT

What follows started as a brainstorming prompt to an LLM. Seemed fitting to share it here.

For most of the twentieth century, American auto manufacturers managed their suppliers the same way — write exhaustive specifications and put the contract out for bid. The assumption was simple: the more precise your instructions, the better the output.

Toyota did the opposite. Their engineers handed suppliers specifications using words like gotsu gotsu (a low-frequency, high-impact motion felt in the lower back) or buru buru (a high-frequency, low-impact vibration felt in the belly) — a specialized but deliberately imprecise vocabulary that described how a component should feel rather than exactly how it should measure. This could have been a disaster. But Toyota developed the fastest and most efficient vehicle development cycle in the industry, using fewer engineers than its competitors. The difference wasn’t that Toyota had better suppliers — they often used the very same ones the Big Three were fighting with. It was that Toyota shared deep context and motivations instead of just instructions. Supplier engineers lived full-time in Toyota’s design offices on two- to three-year rotations, absorbing not just what Toyota wanted built, but why — the design philosophy, the customer intent, the trade-offs that mattered. When a Toyota engineer used a term like gotsu gotsu, a supplier who’d spent two years immersed in that context knew exactly what it meant. A supplier working from a spec sheet at arm’s length could never match that. This is a key part of Keiretsu.

Today, practitioners agree that better context is key to producing the best results with AI. But there’s an underexplored dimension of the context problem that I haven’t seen discussed widely: not just how to give AI context, but how to share the right information across the boundaries where it gets stuck today — between departments, organizations, and the vendors they work with. We need a modern Keiretsu — one that works at AI scale.

Let me walk through how I got here.

A list of questions that looks fine but isn’t

I’ve been building an AI interview framework, and a potential customer recently sent me a list of questions they wanted my AI interviewer to use for a demo:

Overall usefulness — In your experience, how useful was Product X during your workday?
Relevance — Did the insights or alerts generated by the system feel relevant and actionable? Can you provide specific examples?
Impact on decision-making — Did the platform influence any operational decisions you made (e.g., X, Y, Z)? If so, how?
Workflow integration — How well did Product X fit into your existing workflow? Did it feel like a natural augmentation or an additional task?
Signal vs. noise — Did you feel the system helped you identify important events as they were happening (e.g., X, Y, Z)? If not, would you prefer some form of alert or notification?
Awareness of key issues — Did the platform improve your awareness of X, Y, or Z issues? If yes, in what ways?
User interface and usability — How intuitive was the interface? Were there any parts of the interface that were confusing or slowed you down?
Trust in the system — How much did you trust the insights or recommendations generated by the system? What would increase your trust?
Highest value features — Which specific features or outputs from Product X felt most valuable to you?
Missing capabilities — What important information, functionality, or workflow support did you feel was missing from the system?

These look like perfectly reasonable research questions. But here’s the problem: if you gave this same list to five different AI interview tools in a bake-off, there would be almost no difference in the quality of what each one produces. The differences will mostly be at the edges. Some AI interviewers will be more conversational (providers using OpenAI’s realtime API will excel here; providers will differ based on VAD strategies or pipeline latency). Some will transcribe better (vendors who give STT providers custom domain-specific vocabulary will do well; some might even use multiple STT providers in creative ways).

But the quality of the questions themselves? That’s where the field really flattens out. There’s no background knowledge of the company, its product, what the founders care about, or prior interviews to draw on. Without that, you can’t ask meaningful, deep-hitting questions. You just get polite, generic follow-ups to polite, generic answers.

This reminded me of a quote from Amp in the context of a product pivot: “With GPT-5.3-Codex, the agent is no longer the bottleneck. Our ability to tell it what to do is.”

What kind of context actually matters?

At first I thought the answer could be context on how to conduct an interview — scan a hundred books on the art of interviewing, distill that knowledge into great prompts, maybe even fine-tune a model. Make sure the AI doesn’t ask leading questions. Make it good at connecting surface-level responses to deeper needs. This is genuinely useful, but it wasn’t the big unlock that I hoped.

It’s only part of the picture. While at Google and YouTube, I spent over 100 days speaking with users or listening to moderators do so. Hot take: the most useful interviews are not those conducted by a rigorously trained professional researcher or moderator. I want to be careful here — professional UX researchers are incredibly good at what they do, and there are domains where that rigor is exactly what you need: usability studies, academic research, longitudinal studies, and more. But when the goal is to inform a product or business decision, a purely rigorous academic approach tends to produce balanced, “correct” reports — thorough and polished, but not always the sharpest tool for someone who needs to make a bold bet.

Product, marketing, and business visionaries don’t make their best decisions when everything is balanced. The decisions that actually move businesses forward are opinionated, pointed — and right. In my experience, the most interesting and valuable research comes from someone who has both interviewing skill and — more importantly — deep, intimate knowledge of and opinions about a problem space or product.

And that’s what’s missing from creating an AI interviewer from just this list of questions. To build one that conducts truly impactful interviews, we need not only the interviewing skills, but also deep, intimate knowledge and context about the problem space and the company.

The deck gap

I could ask the prospective customer for more context — how they think about the business, where the product is headed, what hypotheses they’re operating under — and translate that into prompts for the AI interviewer. But in practice, what I’d likely get from the customer is a distilled version of that thinking (a PPT), shaped for sales calls, roadmap presentations, or investor conversations. This flow of information isn’t very efficient. A deck is often meant to accompany a voiceover. Even when it’s not, it’s a format built for humans to tell a visual story. Can a multimodal LLM consume it? Sure. But it’s far from the most efficient way. What works best is a well-constructed body of text that LLMs can access and use to generate more impactful prompts.

When we step back and think about where the world is headed, it becomes increasingly likely that the deck itself was created using a not-dissimilar well-organized set of text — given to the founder’s AI tool of choice. So why not just share the prompts used to create the deck rather than the deck itself? It’s more useful, avoids a very lossy translation, and saves everyone time going back and forth with LLMs.

There’s an even deeper issue with sharing a deck or any polished deliverable: it’s a snapshot of the conclusion, not the journey. A friend in marketing put it well: “How often do you send a brief to an agency, only for them to come back with an idea you’ve already discussed and killed?” This happens constantly. Given similar constraints, smart people (and smart AI tools) will often converge on the same answer. But sometimes that answer was already tried and abandoned for reasons that never made it into the final artifact. The discussions about why the obvious path didn’t work are often the most enlightening, and they’re precisely what gets lost. Ask someone who’s been on a project from day one to describe it and you’ll get a far richer picture than from someone who joined last week. I’d call this upstream context — and for vendors especially, it’s transformative. A sales prospect you’re pitching cares about what your product offers today and where it’s going tomorrow. But a vendor you work with benefits immensely from understanding how you got here.

Getting from here to a world where context flows freely to the AI tools that need it requires solving three problems — each interesting in its own right, but building toward the one I think represents the biggest opportunity.

Problem 1: Well-formatted context

The first problem is format. Engineering organizations had a head start here because so much of their source material was already LLM-friendly. Code is text. Config files are text. Documentation is markdown.

Other parts of an organization aren’t so lucky. Marketing lives in slide decks and brand guides full of images and layout. Sales lives in CRM notes, call recordings, and PDFs of varying quality. Strategy lives in spreadsheets and board decks. The information exists, but it’s trapped in formats that weren’t designed for LLM consumption.

It’s fascinating to watch how AI tools are handling this gap today. Most AI services seem to process PDFs by parsing out the text and sending it alongside a rendered image to a multimodal model. PowerPoint files get even more interesting treatment: some tools work directly with the underlying XML structure, which means they can modify slides programmatically but also means the “understanding” of a deck is really an understanding of its markup rather than its narrative. And in both cases, the approach is token-inefficient — what could be clean markdown ends up as a blob of junk text plus an image, polluting the context window unless someone does careful preprocessing with a subagent.

Frontier labs seem to do this best today — better document parsing, richer multimodal understanding, longer context windows. But these are all solutions that try to make LLMs better at consuming messy formats. The more interesting question might be: what if we made the formats less messy in the first place? What if, alongside your deck, you maintained a structured context document — a machine-readable source of truth about your business that any AI tool could consume efficiently?

Getting context into the right format is only the first step. Once it’s machine-readable, it still needs to be organized and sharable.

Problem 2: Well-organized context

We might be giving our prospective customer too much credit for having a “well-organized set of text” that describes their business. More likely, the deck emerged from a messy back-and-forth conversation as their AI helped them iterate toward something beautiful. Maybe they’d talked to Claude so much that their prompts were enriched by an abundance of MEMORY.md files floating around on their machine or in the cloud. But this is a far cry from a canonical, structured source of truth.

We’re already seeing early versions of this in engineering. Look at how Stripe describes the design of their “context gathering: rule files” in their minions blog post: they build per-directory, well-structured text to give LLMs the right context for coding decisions. They don’t rely on individual developers to type out large prompts. They’ve institutionalized it, canonicalized it, and made it accessible to the tools and frameworks that call the LLMs their developers use. They standardized on the rule format and sync those rules into a format Cursor or Claude Code can read as well — so their three most popular coding agents can all benefit from the same guidance.

So we see this is happening on the bleeding edge — within forward-thinking organizations, for engineering use cases — but primarily within a single organization. Well-formatted, well-organized context is a necessary foundation — but when it lives entirely inside one organization, it only solves half the problem.

Problem 3: Context sharing across boundaries

This is where I think the big opportunity lives.

The boundaries that matter aren’t just between companies — they’re everywhere information gets siloed. Within a single organization, departments often operate with limited visibility into each other’s context, and frequently for good reason. You might not want the compliance team sharing everything they work on with the entire company. But you probably do want them to share enough context that everyone can make sure they’re following the rules. Sales has context about what specific customers say they need. Product has context about what’s being built and why. Marketing has context about how the product is positioned. Each of these would make the others’ AI tools dramatically more useful — but today that context rarely flows between them in a structured, organized, machine-readable way.

The same problem exists across organizational boundaries, and gets even harder. Even if an organization builds beautifully structured, LLM-ready context about their business — their strategy, their product, their market position, their users — they will be hesitant to share the full version with external partners and vendors. And the concern isn’t unfounded: that context is often tangled up with personally identifiable information, proprietary competitive intelligence, or internal strategy. And a vendor almost never needs the PII or the raw proprietary data to do a good job. They need the insights, the strategic direction, the synthesized understanding. The challenge is that today there’s no good mechanism to separate the signal from the sensitive and share just the parts that matter.

Think about my original scenario. If the customer shared rich, structured context about their product, their users, and the key debates happening at their company, my AI interviewer could produce dramatically better results (just as a human moderator with this additional context would). Multiply this across every vendor relationship an organization has — their design agency, their market research firm, their consulting partners, their SaaS tools — and the compounding value of solving this problem becomes clear.

The design space is wide open, and the hypotheticals are fun to think about:

Scoped context views — curated slices of your internal context that expose only what’s relevant to a particular vendor relationship, with everything else redacted or omitted. Think of it like database views, but for organizational knowledge and likely created by agentic systems.
Sandboxed external agents — allow a vendor’s AI agent to operate within your enterprise environment, gather what it needs, and surface the outputs for human approval before anything crosses the boundary.
Context escrow — a neutral intermediary that holds the full picture but only passes through the specific pieces a vendor’s system requests, with audit trails and access controls.
NDA-gated full access — give vendors access to everything and trust the legal framework. This has the advantage of working today, but doesn’t scale and doesn’t address the legitimate instinct to limit exposure.
A context exchange protocol — something like an API contract, but for business knowledge. Organizations publish structured metadata about what context they can share, and vendor tools negotiate access as it’s needed.

None of these ideas are fully baked, and that’s the point. But some of the building blocks are starting to emerge. Protocols like MCP are creating standards for how AI tools connect to external systems. Features like Claude’s Skills are formalizing how reusable context gets packaged and shared with models. The plumbing for cross-boundary context exchange is being laid — but nobody has assembled it into a complete solution for the problem described here.

The problem is real and growing — as AI tools become more capable, the gap between what they could do with the right context and what they actually do with the context they’re given will only widen. The organizations that figure out how to make context flow safely and efficiently across boundaries will unlock enormous value.

Wrapping up

I recently read Alap Shah’s piece on the “Global Intelligence Crisis”, and while I don’t share his dystopian (or is it optimistic?) view of where we’re headed, one quote in particular caught my attention: “AI agents, however, share nearly perfect, continuous context.”

I don’t believe this is true today. Context management is one of the most important — and most underappreciated — skills in building AI systems. Even within a single session, managing an ever-growing context window is difficult, let alone across sessions or across organizations. But getting this right unlocks huge value.

We can format context better (Problem 1). We can organize it better (Problem 2). But the real unlock — the one that changes the game — is making the right context available across boundaries (Problem 3) — whether that’s between departments, between companies, or between the people who hold the knowledge and the AI tools that need it. Solve that, and you don’t just improve one AI tool. You make every AI tool in the ecosystem better at understanding the businesses they serve.

Introducing Out of Vocabulary

Thu, 02 Apr 2026 03:00:06 GMT

Things I type to LLMs that seem worth sharing — with other LLMs and the occasional human.

This is a blog about building with AI. It's called "Out of Vocabulary," and before I get into what it's about and where it's going, I want to tell you the story of how it got its name — because it's also a story about an early lesson in building AI systems.

Where this started

About a year and a half ago, I decided to learn how to build with LLMs and picked a problem I was genuinely curious about. This was actually my second attempt — I'd first tried to build something with CrewAI and moved on pretty quickly. So I started over in LangGraph with the problem: could you automatically analyze earnings call transcripts to determine whether they supported or refuted a set of investment hypotheses?

The use case was straightforward. If you’re an investor following a particular company — say Nvidia — you listen to their earnings call and those of related companies each quarter. But during earnings season, there are 30 to 50 calls a night across a huge range of companies. Buried in those calls might be small tidbits — a passing comment about AI deployment spending, a mention of shifting infrastructure budgets — that are directly relevant to your Nvidia thesis. You’d never catch them all manually. What if an AI agent could?

The naive approach was obvious: paste the transcript and your hypotheses into ChatGPT and ask what it finds. This worked — sort of. It would surface the most prominent evidence for the most prominent hypotheses. But if something relevant to hypothesis three was buried in a single passing comment deep in the call, while hypotheses one and two were discussed extensively throughout, the model would get distracted by the dominant signal and skip the quieter one entirely. (Keep in mind — this was mid-2024. Models have gotten meaningfully better at this kind of task since then.)

Decomposing the problem

The idea that clicked for me early in working with LangGraph was the Send() API — essentially a map/reduce pattern for LLM analysis. Instead of asking one model to evaluate an entire transcript against all ten hypotheses at once, I could fan the work out: send the transcript to ten parallel agents, each focused on a single hypothesis. This alone was a big improvement — each agent had a narrower task and a clearer focus.

But the transcripts were long — roughly an hour of conversation — and even with the per-hypothesis decomposition, I was still seeing the same attention problem at a smaller scale. A transcript might hammer home one piece of evidence for hypothesis one throughout the call, but mention a second relevant point only once, deep in a Q&A response. The model would miss it.

The natural next step was to break the transcript itself into logical sections and analyze each one independently. Earnings calls have a pretty consistent structure: opening remarks, CEO commentary, CFO financial review, then a long Q&A section where a moderator calls on analysts one at a time. If I could chunk the transcript along those natural boundaries, each chunk would be small and focused enough that the model wouldn’t lose the thread.

The chunking problem

This turned out to be much harder than I expected. At the time — this was mid-2024, peak RAG era — the standard approaches to chunking were designed for retrieval pipelines: split by token count, maybe with some overlap, maybe try to break at sentence or paragraph boundaries. These were fine for RAG but weren’t designed for what I needed. They might split an answer into two chunks where the second half builds on the first — and in doing so, lose the context of what question was being answered. Or they’d lump a question in with the previous response instead of the following one, putting the context in exactly the wrong place. The tools were just looking for sentence ends and paragraph breaks — they had no concept of the semantic structure of the conversation.

So I thought: why not have an LLM do the chunking? Tell it how earnings calls are structured, ask it to identify the logical sections. I’d seen Greg Kamradt’s “5 Levels of Text Splitting” framework making the rounds — the top level was what he called agentic chunking, where you essentially ask an LLM to be the chunker. That seemed like exactly the right approach.

It was — in theory. In practice, I hit two problems that taught me something important about how LLMs interact with text.

Problem one: reproduction fidelity. When I asked the LLM to return each chunk as text, it would subtly alter the transcript. Nothing dramatic — it might remove extra line breaks, split a run-on sentence into two, clean up formatting. Honestly, the changes often improved the text. But I was using these chunks as source material for downstream analysis, and I needed them to be the exact original text. Even small changes meant I couldn’t guarantee fidelity, and at scale I was worried those small changes might occasionally become bigger ones.

Problem two: counting. The obvious workaround was to have the LLM tell me where each section started and ended — character positions — so I could slice the original transcript myself. But LLMs are terrible at counting characters. They’d be off by dozens or hundreds of positions, and the resulting slices were useless.

I tried a bunch of approaches. Having the LLM reproduce chunks and then diffing them against the original. Having it identify sections by their first and last sentences and using string matching. Iterative feedback loops where I’d show it the diffs and ask it to correct them. That last one was particularly frustrating — it would fix three errors and introduce three different ones, endlessly cycling.

The spaCy trick

The breakthrough came from a different part of the AI world entirely.

I’d been thinking about how existing chunking tools worked — looking for periods, line breaks, punctuation — and their limitations. Abbreviations like “Mr.” would trip them up, creating false sentence boundaries. What I needed was something that could reliably identify sentence boundaries in messy, real-world text.

That led me to spaCy. For those unfamiliar, spaCy is a natural language processing library — a tool from the pre-LLM era of AI that handles tasks like tokenization, part-of-speech tagging, and sentence boundary detection. It uses pre-trained models that understand the structure of language at a grammatical level, not a statistical-next-token-prediction level.

What I did was simple: I used spaCy to identify every sentence in the transcript, then prepended a number to each one. So the LLM didn’t see raw text — it saw:

[1] Good afternoon, and welcome to the Q3 earnings call. [2] I’m joined today by our CEO and CFO. [3] Before we begin, I’d like to remind everyone that...

Now when I asked the LLM to identify the logical sections of the call, it didn’t need to reproduce any text or count any characters. It just had to say: “The CEO’s opening remarks run from sentence 14 to sentence 47. The CFO section runs from sentence 48 to sentence 103.” The LLM was anchoring to labels that were already in the text rather than trying to do spatial reasoning it’s fundamentally not built for.

This worked remarkably well. The chunking became fast and reliable. I could enforce guardrails — verify that every sentence number appeared in exactly one chunk, flag any gaps. The downstream analysis improved dramatically because each chunk was now a coherent, complete section of the conversation, and the text sent to the analysis agents was the exact original transcript, not an LLM’s approximation of it.

Two worlds in conversation

What I love about this solution is that it came from putting two different disciplines into conversation with each other. LLMs and traditional NLP come from related but distinct lineages within AI. The LLM brought the semantic understanding — it knew what a CEO’s opening remarks looked like and where one topic ended and another began. SpaCy brought the structural precision — it could reliably decompose text into sentences in a way that was deterministic and exact. Neither could solve the problem alone. Together, they nailed it.

It reminds me of the most interesting classes I took in college — the ones cross-listed across departments. I took a class on psychology and economics that didn’t go particularly deep on either discipline in isolation, but taught you to see how insights from one could reshape assumptions in the other. Building with AI feels like that constantly. The best solutions often come from pulling in ideas that weren’t designed for your specific problem but turn out to be exactly what you need.

And that's where the name comes from

While I was deep in spaCy’s documentation, I ran into the term “out of vocabulary” — it refers to tokens that a model encounters which weren’t present in its training data. Words it’s never seen before. The model has to figure out what to do with something it has no prior knowledge of.

I liked the resonance. This blog is about building with AI, and a lot of what I want to write about are ideas that feel, at least to me, a little out of vocabulary — things that maybe haven’t been discussed enough, or approaches that combine familiar concepts in unfamiliar ways, or lessons learned from building things where the playbook doesn’t exist yet.

What this is

I’m going to keep this simple. “Out of Vocabulary” is where I’ll share things I’ve learned from building AI systems — ideas, patterns, mistakes, and the occasional strong opinion. Some posts will be technical. Some will be more strategic. All of them started as something I typed to an LLM that seemed worth sharing more broadly.

If I’m being honest, I almost didn’t publish any of this. When you build with LLMs, you spend a lot of time getting enthusiastic validation from your AI collaborator, and after a while you start to wonder: is this actually interesting, or has the sycophancy just convinced me it’s better than it is? I genuinely don’t know. But I figure the only way to find out is to put it out there and see if it resonates with anyone who isn’t trained to be encouraging.

So — welcome.