Build Log · April 2026

Building a RAG chatbot that sounds like me.

A technical build log for the chatbot on this site — how it works, why I built it, what I learned, and the feedback loop that makes it better over time.

0
retrieval chunks
0
Medium essays ingested
0
components in the stack
0
/mo operating cost

Why build this?

I lead Business and Industry Copilots at Microsoft. Shipping AI products is what I do. Shipping one on my own site — grounded in my own writing, with a feedback loop I control — felt like the most honest portfolio piece I could produce. Four goals drove the build.

1

Answer real questions

Visitors ask natural questions about my background. A resume can't do that. LinkedIn can't either. A grounded chatbot can — if it's retrieving from material I actually wrote.

2

Surface my writing

I've published 120+ essays on Medium about AI product craft. Most visitors never click through. A bot that cites my posts in-conversation, with a link, surfaces the work where readers already are.

3

Eat my own dogfood

I sell AI products to enterprises. Shipping one myself — end-to-end, on a $0/month budget — forces every design decision to be defensible and gives me ground truth I can speak from.

4

Build a learning loop

Every visitor question is logged. I review them weekly, answer the gaps in a curated FAQ, and redeploy. The bot gets measurably better at the questions it gets asked. That loop is the product.

System architecture

Two runtime paths and one publishing path. The runtime answers questions; the publishing path is how I add new knowledge to the corpus. Keeping them separate lets each stay simple.

System architecture diagram Browser chatbot-ui.js streaming SSE Netlify Functions /api/chat (SSE stream) /api/admin/questions retrieval + confidence Bearer-token admin Groq (Llama 3.3) answering LLM HF Inference bge-m3 embeddings public/vectors.json 1,130 chunks · cosine in JS Netlify Blobs telemetry store content/ resume, FAQs, Medium, press, projects npm run build-index chunk (marked) embed (bge-m3) entity tag (Phi SLM) git commit + push Netlify redeploys live site new citations RUNTIME PATH · every visitor question PUBLISHING PATH · how new knowledge enters the bot weekly review → new FAQ entry
Runtime — per-request Publishing — on content change Static asset / data

On the runtime path, a visitor's question hits /api/chat, a Netlify Function embeds it via Hugging Face, finds the most relevant chunks in vectors.json, passes them to Groq's hosted Llama 3.3 70B, and streams the answer back as Server-Sent Events. The publishing path is deliberately disconnected from runtime — I edit markdown locally, run the indexer, and push. Netlify redeploys and the new knowledge is live. No database. No admin CMS. Just files, git, and two Netlify Functions.

How RAG actually works here

Retrieval-Augmented Generation is a simple idea with a lot of implementation surface. Here's what actually runs every time someone asks the bot a question, step by step, with the tradeoffs I made at each layer.

1
Guardrail
PII / off-topic filter
2
Embed
bge-m3, 1024-dim
3
Retrieve
cosine top-5
4
Generate
Llama 3.3 70B
5
Score
confidence composite
6
Stream + log
SSE + telemetry

Embedding — turning text into coordinates

When a visitor asks "What is Sam's view on AGI?", the first step is turning that question into a vector — a list of 1,024 floating-point numbers that represents the question's meaning in high-dimensional space. I use bge-m3, an Apache-2.0 embedding model hosted on Hugging Face Inference (free tier). It's state-of-the-art on retrieval benchmarks and multilingual by default. The model was selected for license clarity first and benchmark strength second.

Every piece of content I've ingested — resume, FAQs, Medium essays, press features, project pages — was embedded the same way during a one-time offline indexing pass. The result is a single 36 MB file, vectors.json, containing 1,130 chunks with their vectors and metadata. The chatbot function reads this file once per cold start and keeps it in memory.

Why a JSON file instead of a vector database?

pgvector on Postgres, Qdrant, Weaviate, Pinecone — any of these would work. I picked the simplest viable option. At ~1,100 chunks, loading everything into memory and running cosine similarity in JavaScript takes under 50 ms per query. There's no database to provision, no schema migration, no latency from a separate service. Git diffs on vectors.json make every corpus change reviewable. The ceiling is around 5,000 chunks / 20 MB, after which I'd switch to Neon + pgvector — a three-day migration with the code abstraction I already built. Until then, a JSON file is the right answer.

Retrieval — cosine similarity, with a trust boost

The function computes cosine similarity between the question's vector and every chunk's vector, then takes the top 5. Chunks tagged trust_tier: 1 — my resume and curated FAQs — get a 7% score boost over trust_tier: 2 chunks (Medium posts, press articles, project pages). That way, direct factual questions like "where does Sam work?" surface the resume first, while philosophical questions like "what does Sam think about AGI?" surface the right Medium essay because nothing in the resume covers it.

Generation — grounded answer, streamed live

The top 5 retrieved chunks become the SOURCES block inside a carefully-tuned system prompt. The prompt has three parts: (1) voice exemplars — three paragraphs from my own Medium posts that show rhythm and vocabulary, (2) strict grounding rules ("answer ONLY from the sources; if they don't cover it, say so"), and (3) a third-person-only directive since visitors are reading about me, not as me.

Llama 3.3 70B runs on Groq's free tier at ~300 tokens per second. Answers stream back to the browser token-by-token via Server-Sent Events so readers see the reply arrive live rather than waiting for the full response.

Confidence scoring — telling readers how sure the bot is

Every reply gets a confidence score between 0 and 100%. It's a composite of two signals: retrieval similarity (how close was the best chunk?) and model self-report (the model emits a hidden 1–5 confidence tag that the server filters out of the visible stream). Thresholds classify the answer as High, Medium, or Low confidence — the UI shows a color-coded dot plus the percentage.

Live demo

What does Sam do at Microsoft?
84% confident · sources: Resume — Microsoft

Sam serves as a Senior Product Manager at Microsoft, focused on Business and Industry Copilots…

What is Sam's view on AGI?

Sam is skeptical of AGI as a goal. Instead, he proposes a Leader-Agent Model: a leader agent querying a board of subject-matter expert agents for counsel…

What is the capital of France?
13% confident · fallback triggered

That's a good question, and honestly I don't have a confident answer from what Sam has written on this so far. I've logged the question and Sam will see it in his review queue…

GraphRAG Path 4 — entity-aware retrieval

Beyond the vector match, each chunk carries metadata tags extracted during indexing: people, companies, roles, years, topics. When a question mentions a specific entity — say, "Nuance" or "Watson" — the retrieval can pre-filter to chunks tagged with that entity before scoring by vector similarity. This is a simplified version of Microsoft Research's GraphRAG approach: the benefit of entity-aware filtering without the cost of building a full knowledge graph.

Why not full GraphRAG with community summaries?

Microsoft's GraphRAG builds a knowledge graph using LLM-generated entity triples and community summaries. It's genuinely better for multi-hop reasoning queries. It also costs $20–80 per reindex for a corpus this size — an LLM call per chunk, twice. I picked the 90%-of-the-value / 5%-of-the-cost version: entity tags per chunk, no graph database, no community summaries. If the bot ever gets confused on multi-hop questions (which hasn't happened yet), the clean upgrade path is already scoped.

What's in the corpus

Everything the bot knows comes from one folder of markdown files. 1,130 chunks across five source types, split by a hand-written chunker at ~900 character boundaries.

Medium essays — 120 posts, ~1,000 chunks. Trust tier 2. The source of my voice.
External — press features (Sarasota Magazine), project pages (GrIT, LabOS, Voice Similarity), ~30 chunks. Trust tier 2.
Resume — structured by role, ~12 chunks. Trust tier 1 (7% retrieval boost).
FAQs — curated from visitor telemetry. Trust tier 1. Grows every week.

How I ingest new content

Three scripts handle all ingestion. Each writes markdown with YAML frontmatter into content/, then npm run build-index re-embeds everything and writes a fresh vectors.json. Git commit, push, Netlify deploys.

Medium essays — bulk HTML → markdown

Medium lets you export your writing as an HTML archive. A Python script walks all 120 files, pulls the <a class="p-canonical"> anchor for the real article URL, strips boilerplate (auto-summary footers, "8 min read", "Press enter to view image in full size"), and emits one clean markdown file per post.

External URLs — Readability + Turndown

For press mentions, project pages, and third-party articles: I paste URLs into a queue file, run npm run ingest-url, and the script fetches each one, extracts article prose using Mozilla's Readability (the same engine Firefox Reader Mode uses), converts to markdown with Turndown, and writes a dated file with the canonical URL preserved. Processed URLs rotate to a log file so they never re-ingest.

FAQs — authored from telemetry

The FAQ file is the one corpus I write entirely by hand, driven by what visitors have actually asked. The telemetry clustering tool tells me which questions I need to answer; I write the answer in my voice; I rebuild the index. See the next section.

Telemetry and the feedback loop

Every question ever asked is logged to a Netlify Blobs store — question text, the answer, the confidence score, the retrieved chunk ids, latency, timestamp. Weekly, I review them and decide which ones deserve a curated answer. That loop is what makes the bot better over time.

The loop, step by step

Telemetry feedback loop diagram 1. Visitor asks any question 2. Bot answers + logs to Blobs 3. Cluster weekly priority ranking 4. Pick top 3 gap clusters 5. Write FAQ in my voice 6. Rebuild index + git push 7. Netlify deploys new chunk is live 8. Next visitor gets a confident answer

Semantic clustering of questions

Raw telemetry is noisy. Readers ask the same thing in five different ways: "Is Sam available for consulting?", "Should I hire Sam?", "Does Sam do private work?" — these should collapse into one cluster of demand, not three rows to review individually.

A clustering script embeds every logged question using the same bge-m3 model used for the corpus, then greedy-clusters at cosine similarity ≥ 0.75. Each cluster gets ranked by size × (1 − avg_confidence) — so a 4-question cluster at 30% average confidence ranks higher than a one-off at 10%. The tool prints a terminal report with the top gaps and suggested FAQ answers.

$ npm run cluster-telemetry

8 clusters from 12 questions  (threshold=0.75)
1 clusters have priority ≥ 1.5 — those are the strongest FAQ candidates

[1] ◆ MED   3 asks · 46% avg · priority 1.62
    → What is Sam's view on AGI?
[2] ◆ MED   2 asks · 49% avg · priority 1.02
    → Where did Sam first work / go to college?
[3] ◆ low   2 asks · 50% avg · priority 0.99
    → Does Sam do private consulting?

Privacy posture

No IP logging. No cookies beyond an anonymous 7-day session id for correlating a single visitor's questions across a conversation. No visitor email capture on the low-confidence fallback — the bot simply tells them the question was logged. The admin endpoint that surfaces telemetry is Bearer-token gated; all bulk-delete operations require explicit ?confirm=yes.

The stack

Eight named components. Every one is either open-source, free-tier, or both. Total operating cost at this traffic: $0/month.

RoleComponentLicense
Static site + Functions + BlobsNetlifyProprietary, free tier
LLM inference (answering)Groq — Llama 3.3 70BLlama Community License
LLM inference (ingestion SLM)Groq — Llama 3.1 8BLlama Community License
EmbeddingsHugging Face Inference — bge-m3Apache 2.0 model
Vector storepublic/vectors.json, cosine in JS
Telemetry storeNetlify BlobsProprietary, free tier
Runtime frameworkNode 20 + TypeScript + native fetchOSS
Admin authSingle ADMIN_TOKEN + timingSafeEqual

Tooling I wrote

The ingestion + telemetry tooling is as much of the project as the chat function itself. Everything below runs via npm run <name>:

ScriptWhat it does
build-indexReads every markdown file under content/, chunks on ## headings with a ~900-char paragraph splitter, embeds via bge-m3, extracts entity tags via Llama 3.1 8B, writes public/vectors.json.
ingest-urlPaste URLs into _urls.txt, fetch each, extract article prose via Mozilla's Readability, convert to markdown, write a dated file with canonical URL preserved.
cluster-telemetryPull questions from Blobs, embed each, greedy-cluster by cosine similarity, rank clusters by priority, print a terminal report with the top FAQ candidates.
pull-telemetryStream the full question log to stdout, with optional status/date filters. Feeds into jq or any text tool.
clear-telemetryBulk-delete logged questions (all, by status, by date, or by id). Preview-first, confirm-required for bulk operations.

Design tradeoffs I made explicitly

Every one of these had a higher-ceiling alternative I rejected on purpose. Each decision has an escape hatch if the tradeoff ever stops making sense.

No database. JSON file instead.

Rejected: Postgres + pgvector on Neon, Qdrant, Pinecone. Why: at <5,000 chunks the JSON-in-memory approach is faster, simpler, and git-diffable. Every corpus change is a readable PR. Escape hatch: the retrieval abstraction means swapping to pgvector is a three-day refactor I've already scoped.

No RAG framework. ~300 LOC of my own code.

Rejected: LlamaIndex.TS, LangChain.js. Why: my pipeline is fixed. I retrieve, rerank-lite, generate, score. I don't need a framework's flexibility for multi-corpus workflows I'll never have. Debug trace is one file, not seven abstraction layers. Escape hatch: dropping in LlamaIndex later is one day's work — the retrieval interface is clean.

No fallback LLM provider. Groq or the graceful fallback message.

Rejected: Together.ai paid fallback, Ollama local. Why: Groq's free tier handles realistic portfolio-site traffic. When it rate-limits, visitors get the confidence-scored fallback message — better than exposing a half-broken paid second vendor. Escape hatch: adding Together as a fallback is a one-file change.

No auth library for admin. Single env-var token.

Rejected: Auth.js magic-link email flow, Supabase Auth. Why: I'm the only admin. Magic-link via email would add two whole services to save 15 lines of token comparison. The token lives in Netlify's env-var UI, gets passed as a Bearer header or ?token= query param, and timingSafeEqual handles the compare. Escape hatch: Auth.js ports cleanly if I ever have more than one admin.

No live web search at query time.

Rejected: Brave / Tavily / Exa search as a tool the bot can call mid-conversation. Why: that abandons the deterministic, vetted-corpus trust model. The bot would answer about topics it's never been vetted on, citing third-party pages that might not be accurate. That's a different product entirely — a retrieval-augmented agent, not a reliable bio bot. If I ever want that capability, it lives in its own tool, not polluted into this one.

Limitations and what's next

Everything I haven't done yet, and honest reasons why.

Known limitations

Roadmap

Try the bot

Ask it about my career, my AI product work, my writing, anything from my background. Every question you ask helps me make it better.

Open sambobo.com and try the bot →