Building a RAG Chatbot That Sounds Like Me

Why build this?

I lead Business and Industry Copilots at Microsoft. Shipping AI products is what I do. Shipping one on my own site — grounded in my own writing, with a feedback loop I control — felt like the most honest portfolio piece I could produce. Four goals drove the build.

Answer real questions

Visitors ask natural questions about my background. A resume can't do that. LinkedIn can't either. A grounded chatbot can — if it's retrieving from material I actually wrote.

Surface my writing

I've published 120+ essays on Medium about AI product craft. Most visitors never click through. A bot that cites my posts in-conversation, with a link, surfaces the work where readers already are.

Eat my own dogfood

I sell AI products to enterprises. Shipping one myself — end-to-end, on a $0/month budget — forces every design decision to be defensible and gives me ground truth I can speak from.

Build a learning loop

Every visitor question is logged. I review them weekly, answer the gaps in a curated FAQ, and redeploy. The bot gets measurably better at the questions it gets asked. That loop is the product.

How RAG actually works here

Retrieval-Augmented Generation is a simple idea with a lot of implementation surface. Here's what actually runs every time someone asks the bot a question, step by step, with the tradeoffs I made at each layer.

Guardrail

PII / off-topic filter

Embed

bge-m3, 1024-dim

Retrieve

cosine top-5

Generate

Llama 3.3 70B

Score

confidence composite

Stream + log

SSE + telemetry

Embedding — turning text into coordinates

When a visitor asks "What is Sam's view on AGI?", the first step is turning that question into a vector — a list of 1,024 floating-point numbers that represents the question's meaning in high-dimensional space. I use bge-m3, an Apache-2.0 embedding model hosted on Hugging Face Inference (free tier). It's state-of-the-art on retrieval benchmarks and multilingual by default. The model was selected for license clarity first and benchmark strength second.

Every piece of content I've ingested — resume, FAQs, Medium essays, press features, project pages — was embedded the same way during a one-time offline indexing pass. The result is a single 36 MB file, vectors.json, containing 1,130 chunks with their vectors and metadata. The chatbot function reads this file once per cold start and keeps it in memory.

Why a JSON file instead of a vector database?

pgvector on Postgres, Qdrant, Weaviate, Pinecone — any of these would work. I picked the simplest viable option. At ~1,100 chunks, loading everything into memory and running cosine similarity in JavaScript takes under 50 ms per query. There's no database to provision, no schema migration, no latency from a separate service. Git diffs on vectors.json make every corpus change reviewable. The ceiling is around 5,000 chunks / 20 MB, after which I'd switch to Neon + pgvector — a three-day migration with the code abstraction I already built. Until then, a JSON file is the right answer.

Retrieval — cosine similarity, with a trust boost

The function computes cosine similarity between the question's vector and every chunk's vector, then takes the top 5. Chunks tagged trust_tier: 1 — my resume and curated FAQs — get a 7% score boost over trust_tier: 2 chunks (Medium posts, press articles, project pages). That way, direct factual questions like "where does Sam work?" surface the resume first, while philosophical questions like "what does Sam think about AGI?" surface the right Medium essay because nothing in the resume covers it.

Generation — grounded answer, streamed live

The top 5 retrieved chunks become the SOURCES block inside a carefully-tuned system prompt. The prompt has three parts: (1) voice exemplars — three paragraphs from my own Medium posts that show rhythm and vocabulary, (2) strict grounding rules ("answer ONLY from the sources; if they don't cover it, say so"), and (3) a third-person-only directive since visitors are reading about me, not as me.

Llama 3.3 70B runs on Groq's free tier at ~300 tokens per second. Answers stream back to the browser token-by-token via Server-Sent Events so readers see the reply arrive live rather than waiting for the full response.

Confidence scoring — telling readers how sure the bot is

Every reply gets a confidence score between 0 and 100%. It's a composite of two signals: retrieval similarity (how close was the best chunk?) and model self-report (the model emits a hidden 1–5 confidence tag that the server filters out of the visible stream). Thresholds classify the answer as High, Medium, or Low confidence — the UI shows a color-coded dot plus the percentage.

Live demo

What does Sam do at Microsoft?

84% confident · sources: Resume — Microsoft

Sam serves as a Senior Product Manager at Microsoft, focused on Business and Industry Copilots…

What is Sam's view on AGI?

69% confident · sources: Council of Experts Leader-Agent Model ↗

Sam is skeptical of AGI as a goal. Instead, he proposes a Leader-Agent Model: a leader agent querying a board of subject-matter expert agents for counsel…

What is the capital of France?

13% confident · fallback triggered

That's a good question, and honestly I don't have a confident answer from what Sam has written on this so far. I've logged the question and Sam will see it in his review queue…

GraphRAG Path 4 — entity-aware retrieval

Beyond the vector match, each chunk carries metadata tags extracted during indexing: people, companies, roles, years, topics. When a question mentions a specific entity — say, "Nuance" or "Watson" — the retrieval can pre-filter to chunks tagged with that entity before scoring by vector similarity. This is a simplified version of Microsoft Research's GraphRAG approach: the benefit of entity-aware filtering without the cost of building a full knowledge graph.

Why not full GraphRAG with community summaries?

Microsoft's GraphRAG builds a knowledge graph using LLM-generated entity triples and community summaries. It's genuinely better for multi-hop reasoning queries. It also costs $20–80 per reindex for a corpus this size — an LLM call per chunk, twice. I picked the 90%-of-the-value / 5%-of-the-cost version: entity tags per chunk, no graph database, no community summaries. If the bot ever gets confused on multi-hop questions (which hasn't happened yet), the clean upgrade path is already scoped.

What's in the corpus

Everything the bot knows comes from one folder of markdown files. 1,130 chunks across five source types, split by a hand-written chunker at ~900 character boundaries.

Medium

Ext

FAQ

Medium essays — 120 posts, ~1,000 chunks. Trust tier 2. The source of my voice.

External — press features (Sarasota Magazine), project pages (GrIT, LabOS, Voice Similarity), ~30 chunks. Trust tier 2.

Resume — structured by role, ~12 chunks. Trust tier 1 (7% retrieval boost).

FAQs — curated from visitor telemetry. Trust tier 1. Grows every week.

How I ingest new content

Three scripts handle all ingestion. Each writes markdown with YAML frontmatter into content/, then npm run build-index re-embeds everything and writes a fresh vectors.json. Git commit, push, Netlify deploys.

Medium essays — bulk HTML → markdown

Medium lets you export your writing as an HTML archive. A Python script walks all 120 files, pulls the <a class="p-canonical"> anchor for the real article URL, strips boilerplate (auto-summary footers, "8 min read", "Press enter to view image in full size"), and emits one clean markdown file per post.

External URLs — Readability + Turndown

For press mentions, project pages, and third-party articles: I paste URLs into a queue file, run npm run ingest-url, and the script fetches each one, extracts article prose using Mozilla's Readability (the same engine Firefox Reader Mode uses), converts to markdown with Turndown, and writes a dated file with the canonical URL preserved. Processed URLs rotate to a log file so they never re-ingest.

FAQs — authored from telemetry

The FAQ file is the one corpus I write entirely by hand, driven by what visitors have actually asked. The telemetry clustering tool tells me which questions I need to answer; I write the answer in my voice; I rebuild the index. See the next section.

Telemetry and the feedback loop

Every question ever asked is logged to a Netlify Blobs store — question text, the answer, the confidence score, the retrieved chunk ids, latency, timestamp. Weekly, I review them and decide which ones deserve a curated answer. That loop is what makes the bot better over time.

The loop, step by step

Semantic clustering of questions

Raw telemetry is noisy. Readers ask the same thing in five different ways: "Is Sam available for consulting?", "Should I hire Sam?", "Does Sam do private work?" — these should collapse into one cluster of demand, not three rows to review individually.

A clustering script embeds every logged question using the same bge-m3 model used for the corpus, then greedy-clusters at cosine similarity ≥ 0.75. Each cluster gets ranked by size × (1 − avg_confidence) — so a 4-question cluster at 30% average confidence ranks higher than a one-off at 10%. The tool prints a terminal report with the top gaps and suggested FAQ answers.

$ npm run cluster-telemetry

8 clusters from 12 questions  (threshold=0.75)
1 clusters have priority ≥ 1.5 — those are the strongest FAQ candidates

[1] ◆ MED   3 asks · 46% avg · priority 1.62
    → What is Sam's view on AGI?
[2] ◆ MED   2 asks · 49% avg · priority 1.02
    → Where did Sam first work / go to college?
[3] ◆ low   2 asks · 50% avg · priority 0.99
    → Does Sam do private consulting?

Privacy posture

No IP logging. No cookies beyond an anonymous 7-day session id for correlating a single visitor's questions across a conversation. No visitor email capture on the low-confidence fallback — the bot simply tells them the question was logged. The admin endpoint that surfaces telemetry is Bearer-token gated; all bulk-delete operations require explicit ?confirm=yes.

The stack

Eight named components. Every one is either open-source, free-tier, or both. Total operating cost at this traffic: $0/month.

Role	Component	License
Static site + Functions + Blobs	Netlify	Proprietary, free tier
LLM inference (answering)	Groq — Llama 3.3 70B	Llama Community License
LLM inference (ingestion SLM)	Groq — Llama 3.1 8B	Llama Community License
Embeddings	Hugging Face Inference — bge-m3	Apache 2.0 model
Vector store	`public/vectors.json`, cosine in JS	—
Telemetry store	Netlify Blobs	Proprietary, free tier
Runtime framework	Node 20 + TypeScript + native `fetch`	OSS
Admin auth	Single `ADMIN_TOKEN` + `timingSafeEqual`	—

Tooling I wrote

The ingestion + telemetry tooling is as much of the project as the chat function itself. Everything below runs via npm run <name>:

Script	What it does
`build-index`	Reads every markdown file under `content/`, chunks on ## headings with a ~900-char paragraph splitter, embeds via bge-m3, extracts entity tags via Llama 3.1 8B, writes `public/vectors.json`.
`ingest-url`	Paste URLs into `_urls.txt`, fetch each, extract article prose via Mozilla's Readability, convert to markdown, write a dated file with canonical URL preserved.
`cluster-telemetry`	Pull questions from Blobs, embed each, greedy-cluster by cosine similarity, rank clusters by priority, print a terminal report with the top FAQ candidates.
`pull-telemetry`	Stream the full question log to stdout, with optional status/date filters. Feeds into `jq` or any text tool.
`clear-telemetry`	Bulk-delete logged questions (all, by status, by date, or by id). Preview-first, confirm-required for bulk operations.

Design tradeoffs I made explicitly

Every one of these had a higher-ceiling alternative I rejected on purpose. Each decision has an escape hatch if the tradeoff ever stops making sense.

No database. JSON file instead.

Rejected: Postgres + pgvector on Neon, Qdrant, Pinecone. Why: at <5,000 chunks the JSON-in-memory approach is faster, simpler, and git-diffable. Every corpus change is a readable PR. Escape hatch: the retrieval abstraction means swapping to pgvector is a three-day refactor I've already scoped.

No RAG framework. ~300 LOC of my own code.

Rejected: LlamaIndex.TS, LangChain.js. Why: my pipeline is fixed. I retrieve, rerank-lite, generate, score. I don't need a framework's flexibility for multi-corpus workflows I'll never have. Debug trace is one file, not seven abstraction layers. Escape hatch: dropping in LlamaIndex later is one day's work — the retrieval interface is clean.

No fallback LLM provider. Groq or the graceful fallback message.

Rejected: Together.ai paid fallback, Ollama local. Why: Groq's free tier handles realistic portfolio-site traffic. When it rate-limits, visitors get the confidence-scored fallback message — better than exposing a half-broken paid second vendor. Escape hatch: adding Together as a fallback is a one-file change.

No auth library for admin. Single env-var token.

Rejected: Auth.js magic-link email flow, Supabase Auth. Why: I'm the only admin. Magic-link via email would add two whole services to save 15 lines of token comparison. The token lives in Netlify's env-var UI, gets passed as a Bearer header or ?token= query param, and timingSafeEqual handles the compare. Escape hatch: Auth.js ports cleanly if I ever have more than one admin.

No live web search at query time.

Rejected: Brave / Tavily / Exa search as a tool the bot can call mid-conversation. Why: that abandons the deterministic, vetted-corpus trust model. The bot would answer about topics it's never been vetted on, citing third-party pages that might not be accurate. That's a different product entirely — a retrieval-augmented agent, not a reliable bio bot. If I ever want that capability, it lives in its own tool, not polluted into this one.

Limitations and what's next

Everything I haven't done yet, and honest reasons why.

Known limitations

35 MB vectors.json adds cold-start latency. First request after a function idle period takes ~2–3 extra seconds while the file parses. Warm requests are unaffected. Fix when it matters: migrate to pgvector.
Entity tags are partial. Groq's entity-extraction step gets rate-limited during bulk re-indexing, so roughly 30% of chunks ship with empty entity tags. Vector similarity still works; entity-filtered GraphRAG Path 4 is degraded. A slow overnight reindex backfills everything.
No diarization on transcripts. Podcast transcripts (when I eventually ingest them) need speaker disambiguation — otherwise the bot might attribute an interviewer's question to me.

Roadmap

Voice output via my custom Azure Neural Voice (already built for work; bolting it onto the bot is one new Netlify Function + a mic/audio UI).
Priority quadrant plot for telemetry — confidence × frequency scatter. Useful once traffic is high enough to show real spread.
Weekly AI-generated digest. A Sunday-night cron pulls the week's unanswered clusters, drafts FAQ answers in my voice, and emails me a review-and-commit link. Flips the FAQ loop from active to passive.

Building a RAG chatbot that sounds like me.