How Carey turns your content into a chatbot
Under the hood, Carey uses a technique called retrieval-augmented generation (RAG). Here is the full pipeline, step by step. No PhD required.
You give us your knowledge
FAQs, product docs, pricing pages, policies, transcripts, anything text-shaped. We accept pasted text, file uploads, and live website crawls. This is the raw material your bot will learn from.
We slice it into chunks
Large documents are split into bite-sized passages of a few hundred words each, with a little overlap between neighbors. Smaller chunks make retrieval more precise. The bot can quote the exact paragraph that answers a question, not a whole 40-page PDF.
Each chunk becomes a vector (the embedding step)
This is the heart of the magic. An embedding model reads a chunk of text and turns it into a long list of numbers, typically 3,072 numbers per chunk. That list is called a vector.
The vector is a coordinate in a high-dimensional space where meaning is the geometry. Texts that mean similar things end up close together. "How do I cancel my subscription?" and "I want to end my plan" land near each other, even though they share almost no words. That is the breakthrough.
The model learned this geometry from billions of sentences. It is not matching keywords; it is matching concepts.
Vectors get stored in a vector database
Every chunk's vector is saved alongside the original text, indexed for fast nearest-neighbor search. We use pgvector, Postgres with a special index (HNSW) that can find the closest vectors among millions in milliseconds.
A visitor asks a question
When someone types into your chat widget, we embed their question using the same model, turning it into a vector too.
Then we ask the database: "which stored chunks are closest to this question's vector?" We measure closeness with cosine similarity, essentially the angle between two vectors. Small angle = similar meaning. The top handful of matches become our context.
The LLM answers, grounded in your content
We hand the retrieved chunks to a large language model along with the user's question and instructions like "answer using only this context, and say you don't know if it isn't there."
The LLM is great at writing fluent, on-tone answers. But left alone, it hallucinates. By forcing it to read your retrieved passages first, we keep it honest. It cites your facts in your voice.
Why this works so well
- Fresh knowledge without retraining. Update a doc, re-embed it. The bot knows the new answer in seconds. No model fine-tuning needed.
- Cheap and fast. Embeddings cost a fraction of a generation call, and vector search is sub-second even at scale.
- Auditable. Because every answer is grounded in retrieved chunks, you can always see why the bot said what it did.
- Multilingual by accident. Modern embedding models map "refund policy" and "política de reembolso" to nearby vectors, so a Spanish question can match an English doc.
Ready to try it?
Paste a doc, get a chatbot. Takes about two minutes.
