RAG agents have a deceptively low barrier to entry. You can set up a chatbot for your business by dropping PDFs into a bucket or letting Vertex AI crawl your site, but that speed comes with a hidden tax: the Black Box.
By storing and indexing the entire data source without discretion, these models identify invisible patterns that can sometimes reveal previously unreachable solutions. Mathematicians, including Terence Tao, have recently leveraged these models to address a litany of previously unsolved mathematical problems. But indiscriminate ingestion also produces logical noise and unnecessary node hops on every request. Beyond the latency, the resource waste is a blank check for cloud providers to charge exorbitant fees for tokens. If you can't see the process, you can't calculate the ROI of your infrastructure. AI models have transformed how people interface with information, but builders shouldn't use the same interface to develop software that we use to show it to our stakeholders. The magic trick should be entirely clear to the magician. If the magician and audience are both blind to what is behind the curtain, can you really promise a consistent performance?
This post is the first half of that curtain-lifting for a car dealership RAG system I'm building. I'll explain how I designed the preprocessing layer and why I made the choices I did. I haven't yet measured retrieval quality end-to-end. The inference pipeline comes next time, and that's where the numbers will live.
Retrieval Augmented Generation (RAG) works in three stages. Retrieval scans curated documents to find the relevant data slices needed to answer your prompt. Augmentation bundles that slice with your original query. Generation is the final stage, where the system ranks and aggregates the retrieved data into a single prompt for the LLM to produce a response. The benefit of RAG is that you can curate answers from your own data while still remaining private.
Vector embeddings and traditional SQL
AI models often consume unstructured data to surface new patterns. This has huge applications in industries like finance where information is contextual and unstructured. But many businesses have structured transactional data, the kind you'd use for analytics or product filtering. Car inventory is one of those. A listing already knows its make, model, year, mileage, and price.
Filtering is what SQL was built for. Range queries on price, year, and horsepower run in microseconds against B-tree indexes. A vector search can't beat that on exact dimensions, and throwing embeddings at structured data is how you end up paying cloud providers to re-derive facts you already have in columns.
Vector embeddings earn their cost on the questions SQL can't answer: the ones where the customer's intent doesn't map to any column you thought to create. "Reliable in the snow" isn't a body_type. "Fun to drive on weekends" isn't a fuel_type. These are open-ended descriptions, and the only way to match them against inventory is to project both the customer's phrase and each listing's description into a shared space where semantic similarity becomes geometric closeness.
Your filter columns are short because you decided ahead of time which categories matter. Embeddings are long because the model didn't get to make that decision. It has to represent every possible dimension of meaning a query might care about, because nobody told it which ones a customer would search for. Filter columns are pre-compiled schemas. Embeddings are pre-paid flexibility. The hybrid architecture this post describes is how you pay for each where it makes sense: SQL where the schema is known, embeddings where the questions are open-ended.
Consider a customer who needs an automatic car that drives safely on snowy hills. The relevant concept is "sends torque to all four wheels," which real listings express as either AWD (all-wheel drive, common on automatics) or 4WD (four-wheel drive, common on trucks). A keyword search for "AWD" misses the trucks; a keyword search for "4WD" misses the sedans. An embedding model that has seen both terms used in similar contexts places their vectors close together, and a single semantic query surfaces both categories. That is what vectors are for: collapsing vocabulary gaps between things that mean the same thing to the customer.
Category summaries vs RAPTOR
Filtering doesn't just tailor a selection of vehicles for the customer. It's also how I'm shrinking the data pool before it reaches the LLM. The bet is that fewer rows in retrieval mean lower latency and token cost once the inference pipeline is running.
A popular choice in RAG for preprocessing data is RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). It builds a tree of summaries where the top nodes are less specific than their children. As you descend, your data becomes more closely aligned with the summaries, and eventually you reach a cluster of similar records. RAPTOR is most useful for complex, multi-layered documents where the answer isn't on one page but requires connecting the dots across an entire corpus, as in legal research or scientific literature. For our use case it isn't a good fit. Customer intent is almost always expressed along a small set of filterable dimensions the business already cares about: body_type (sedan, suv), fuel_type (hybrid, electric, gas), and price_bucket (budget, mid, premium, luxury).
So instead of clustering, each unique combination of (body_type, fuel_type, make, price_bucket) becomes one row in a category_summaries table. The groupings are deterministic because SQL produces the same groups from the same input every time. No hierarchy, just a flat list of categories. Maybe 200 of them for a large dealer, each representing one slice of the inventory. Each row carries a precomputed summary string, aggregate stats in JSONB, and a fixed-length embedding of the summary text. The embedding is indexed so semantic lookups stay fast as the category count grows, even as new inventory arrives.
This is what keeps the RAG layer maintainable as the inventory evolves. When a dealer onboards a new fuel_type the system has never seen, such as hydrogen, the category space simply grows by one row the first time a hydrogen listing arrives. No retraining, no re-clustering, no migration. RAPTOR would need the entire tree rebuilt to place the new category correctly; a flat deterministic grouping just appends.
Keeping the index fresh
Once a car is sold, we don't want to recommend it anymore. Since the summary is just precomputed context that the LLM will read later, we can safely overwrite that record whenever the underlying data changes.
The key architectural move is splitting the freshness problem into two stages: detect immediately, regenerate lazily. Every category summary carries a stale boolean column. When a car is sold, repriced, or newly listed, the system computes its category key and flips the matching row's flag in a single-column UPDATE. That write is cheap because it's a single-row boolean flip with no joins and no LLM in the path. If the car's price moved it to a different bucket, both the old and new categories get flagged. The expensive work of re-summarizing the category and re-embedding it is deferred to a background worker that drains the flagged rows on its own schedule.
The background worker reads the full list of cars in each flagged category and calls an LLM once to generate the descriptive text that gets embedded. This is the one place the preprocessing layer uses an LLM, and it's a deliberate place. LLM calls here happen once per inventory change, not once per customer query. The cost is tied to how often your cars change, which is something you control, rather than to how many people are searching, which is something you don't. The summary is written once and read thousands of times. You pay for it on the write side and amortize the cost across every query that retrieves the category.
There's a scaling ceiling on the "full list" approach. When a category contains more cars than fit in an LLM's context window, the summarizer will need to batch the listings and summarize in rounds, then combine the intermediate summaries.
The worker uses Postgres' built-in FOR UPDATE SKIP LOCKED lock mode as a work queue primitive. Multiple workers can query the stale rows in parallel. The first to touch a row locks it, and the others skip past to grab different rows. No Redis, no message broker, no scheduler, no coordination service. Scaling up means running more worker processes against the same database. And because reads from the main recommendation path ignore rows where stale = true, users are never served a category summary that's mid-regeneration or reflects sold inventory.
The reason this matters beyond engineering ergonomics goes back to the opening. If you can't see what your system is doing, you can't reason about its cost. A message broker would add moving parts whose contribution to token spend is invisible. Postgres doing double duty as both storage and queue keeps the entire data path on one system the operator can introspect with a single SQL client.
Four decisions hold this layer up: SQL where the schema is known and embeddings where the questions are open-ended, deterministic category summaries instead of recursive clustering, a background summarizer whose LLM cost is amortized across every read, and Postgres doing double duty as both storage and queue. Each one has a seat you can sit in. The curtain is still there for the audience, but the magician is no longer on the wrong side of it.