A technical build log for the chatbot on this site — how it works, why I built it, what I learned, and the feedback loop that makes it better over time.
I lead Business and Industry Copilots at Microsoft. Shipping AI products is what I do. Shipping one on my own site — grounded in my own writing, with a feedback loop I control — felt like the most honest portfolio piece I could produce. Four goals drove the build.
Visitors ask natural questions about my background. A resume can't do that. LinkedIn can't either. A grounded chatbot can — if it's retrieving from material I actually wrote.
I've published 120+ essays on Medium about AI product craft. Most visitors never click through. A bot that cites my posts in-conversation, with a link, surfaces the work where readers already are.
I sell AI products to enterprises. Shipping one myself — end-to-end, on a $0/month budget — forces every design decision to be defensible and gives me ground truth I can speak from.
Every visitor question is logged. I review them weekly, answer the gaps in a curated FAQ, and redeploy. The bot gets measurably better at the questions it gets asked. That loop is the product.
Two runtime paths and one publishing path. The runtime answers questions; the publishing path is how I add new knowledge to the corpus. Keeping them separate lets each stay simple.
On the runtime path, a visitor's question hits /api/chat, a Netlify Function embeds
it via Hugging Face, finds the most relevant chunks in vectors.json, passes them to
Groq's hosted Llama 3.3 70B, and streams the answer back as Server-Sent Events. The publishing path is
deliberately disconnected from runtime — I edit markdown locally, run the indexer, and push. Netlify redeploys
and the new knowledge is live. No database. No admin CMS. Just files, git, and two Netlify Functions.
Retrieval-Augmented Generation is a simple idea with a lot of implementation surface. Here's what actually runs every time someone asks the bot a question, step by step, with the tradeoffs I made at each layer.
When a visitor asks "What is Sam's view on AGI?", the first step is turning that question into a vector
— a list of 1,024 floating-point numbers that represents the question's meaning in high-dimensional space. I use
bge-m3, an Apache-2.0 embedding model hosted on Hugging Face Inference (free tier).
It's state-of-the-art on retrieval benchmarks and multilingual by default. The model was selected for license
clarity first and benchmark strength second.
Every piece of content I've ingested — resume, FAQs, Medium essays, press features, project pages — was embedded
the same way during a one-time offline indexing pass. The result is a single 36 MB file,
vectors.json, containing 1,130 chunks with their vectors and metadata. The chatbot
function reads this file once per cold start and keeps it in memory.
pgvector on Postgres, Qdrant, Weaviate, Pinecone — any of these would work. I picked the simplest viable
option. At ~1,100 chunks, loading everything into memory and running cosine similarity in JavaScript takes
under 50 ms per query. There's no database to provision, no schema migration, no latency from a separate
service. Git diffs on vectors.json make every corpus change reviewable. The
ceiling is around 5,000 chunks / 20 MB, after which I'd switch to Neon + pgvector — a three-day migration
with the code abstraction I already built. Until then, a JSON file is the right answer.
The function computes cosine similarity between the question's vector and every chunk's vector, then takes the
top 5. Chunks tagged trust_tier: 1 — my resume and curated FAQs — get a 7% score
boost over trust_tier: 2 chunks (Medium posts, press articles, project pages).
That way, direct factual questions like "where does Sam work?" surface the resume first, while
philosophical questions like "what does Sam think about AGI?" surface the right Medium essay because
nothing in the resume covers it.
The top 5 retrieved chunks become the SOURCES block inside a carefully-tuned system prompt. The prompt has three parts: (1) voice exemplars — three paragraphs from my own Medium posts that show rhythm and vocabulary, (2) strict grounding rules ("answer ONLY from the sources; if they don't cover it, say so"), and (3) a third-person-only directive since visitors are reading about me, not as me.
Llama 3.3 70B runs on Groq's free tier at ~300 tokens per second. Answers stream back to the browser token-by-token via Server-Sent Events so readers see the reply arrive live rather than waiting for the full response.
Every reply gets a confidence score between 0 and 100%. It's a composite of two signals: retrieval similarity (how close was the best chunk?) and model self-report (the model emits a hidden 1–5 confidence tag that the server filters out of the visible stream). Thresholds classify the answer as High, Medium, or Low confidence — the UI shows a color-coded dot plus the percentage.
Sam serves as a Senior Product Manager at Microsoft, focused on Business and Industry Copilots…
Sam is skeptical of AGI as a goal. Instead, he proposes a Leader-Agent Model: a leader agent querying a board of subject-matter expert agents for counsel…
That's a good question, and honestly I don't have a confident answer from what Sam has written on this so far. I've logged the question and Sam will see it in his review queue…
Beyond the vector match, each chunk carries metadata tags extracted during indexing: people, companies, roles, years, topics. When a question mentions a specific entity — say, "Nuance" or "Watson" — the retrieval can pre-filter to chunks tagged with that entity before scoring by vector similarity. This is a simplified version of Microsoft Research's GraphRAG approach: the benefit of entity-aware filtering without the cost of building a full knowledge graph.
Microsoft's GraphRAG builds a knowledge graph using LLM-generated entity triples and community summaries. It's genuinely better for multi-hop reasoning queries. It also costs $20–80 per reindex for a corpus this size — an LLM call per chunk, twice. I picked the 90%-of-the-value / 5%-of-the-cost version: entity tags per chunk, no graph database, no community summaries. If the bot ever gets confused on multi-hop questions (which hasn't happened yet), the clean upgrade path is already scoped.
Everything the bot knows comes from one folder of markdown files. 1,130 chunks across five source types, split by a hand-written chunker at ~900 character boundaries.
Three scripts handle all ingestion. Each writes markdown with YAML frontmatter into content/,
then npm run build-index re-embeds everything and writes a fresh
vectors.json. Git commit, push, Netlify deploys.
Medium lets you export your writing as an HTML archive. A Python script walks all 120 files, pulls the
<a class="p-canonical"> anchor for the real article URL, strips boilerplate
(auto-summary footers, "8 min read", "Press enter to view image in full size"), and emits one clean markdown
file per post.
For press mentions, project pages, and third-party articles: I paste URLs into a queue file, run
npm run ingest-url, and the script fetches each one, extracts article prose using
Mozilla's Readability (the same engine Firefox Reader Mode uses), converts to markdown with Turndown, and
writes a dated file with the canonical URL preserved. Processed URLs rotate to a log file so they never
re-ingest.
The FAQ file is the one corpus I write entirely by hand, driven by what visitors have actually asked. The telemetry clustering tool tells me which questions I need to answer; I write the answer in my voice; I rebuild the index. See the next section.
Every question ever asked is logged to a Netlify Blobs store — question text, the answer, the confidence score, the retrieved chunk ids, latency, timestamp. Weekly, I review them and decide which ones deserve a curated answer. That loop is what makes the bot better over time.
Raw telemetry is noisy. Readers ask the same thing in five different ways: "Is Sam available for consulting?", "Should I hire Sam?", "Does Sam do private work?" — these should collapse into one cluster of demand, not three rows to review individually.
A clustering script embeds every logged question using the same bge-m3 model used for the corpus, then
greedy-clusters at cosine similarity ≥ 0.75. Each cluster gets ranked by size × (1 −
avg_confidence) — so a 4-question cluster at 30% average confidence ranks higher than a one-off at
10%. The tool prints a terminal report with the top gaps and suggested FAQ answers.
$ npm run cluster-telemetry
8 clusters from 12 questions (threshold=0.75)
1 clusters have priority ≥ 1.5 — those are the strongest FAQ candidates
[1] ◆ MED 3 asks · 46% avg · priority 1.62
→ What is Sam's view on AGI?
[2] ◆ MED 2 asks · 49% avg · priority 1.02
→ Where did Sam first work / go to college?
[3] ◆ low 2 asks · 50% avg · priority 0.99
→ Does Sam do private consulting?
No IP logging. No cookies beyond an anonymous 7-day session id for correlating a single visitor's questions
across a conversation. No visitor email capture on the low-confidence fallback — the bot simply tells them
the question was logged. The admin endpoint that surfaces telemetry is Bearer-token gated; all bulk-delete
operations require explicit ?confirm=yes.
Eight named components. Every one is either open-source, free-tier, or both. Total operating cost at this traffic: $0/month.
| Role | Component | License |
|---|---|---|
| Static site + Functions + Blobs | Netlify | Proprietary, free tier |
| LLM inference (answering) | Groq — Llama 3.3 70B | Llama Community License |
| LLM inference (ingestion SLM) | Groq — Llama 3.1 8B | Llama Community License |
| Embeddings | Hugging Face Inference — bge-m3 | Apache 2.0 model |
| Vector store | public/vectors.json, cosine in JS | — |
| Telemetry store | Netlify Blobs | Proprietary, free tier |
| Runtime framework | Node 20 + TypeScript + native fetch | OSS |
| Admin auth | Single ADMIN_TOKEN + timingSafeEqual | — |
The ingestion + telemetry tooling is as much of the project as the chat function itself. Everything below
runs via npm run <name>:
| Script | What it does |
|---|---|
build-index | Reads every markdown file under content/, chunks on ## headings with a ~900-char paragraph splitter, embeds via bge-m3, extracts entity tags via Llama 3.1 8B, writes public/vectors.json. |
ingest-url | Paste URLs into _urls.txt, fetch each, extract article prose via Mozilla's Readability, convert to markdown, write a dated file with canonical URL preserved. |
cluster-telemetry | Pull questions from Blobs, embed each, greedy-cluster by cosine similarity, rank clusters by priority, print a terminal report with the top FAQ candidates. |
pull-telemetry | Stream the full question log to stdout, with optional status/date filters. Feeds into jq or any text tool. |
clear-telemetry | Bulk-delete logged questions (all, by status, by date, or by id). Preview-first, confirm-required for bulk operations. |
Every one of these had a higher-ceiling alternative I rejected on purpose. Each decision has an escape hatch if the tradeoff ever stops making sense.
Rejected: Postgres + pgvector on Neon, Qdrant, Pinecone. Why: at <5,000 chunks the JSON-in-memory approach is faster, simpler, and git-diffable. Every corpus change is a readable PR. Escape hatch: the retrieval abstraction means swapping to pgvector is a three-day refactor I've already scoped.
Rejected: LlamaIndex.TS, LangChain.js. Why: my pipeline is fixed. I retrieve, rerank-lite, generate, score. I don't need a framework's flexibility for multi-corpus workflows I'll never have. Debug trace is one file, not seven abstraction layers. Escape hatch: dropping in LlamaIndex later is one day's work — the retrieval interface is clean.
Rejected: Together.ai paid fallback, Ollama local. Why: Groq's free tier handles realistic portfolio-site traffic. When it rate-limits, visitors get the confidence-scored fallback message — better than exposing a half-broken paid second vendor. Escape hatch: adding Together as a fallback is a one-file change.
Rejected: Auth.js magic-link email flow, Supabase Auth. Why: I'm the only admin. Magic-link
via email would add two whole services to save 15 lines of token comparison. The token lives in Netlify's
env-var UI, gets passed as a Bearer header or ?token= query param, and
timingSafeEqual handles the compare. Escape hatch: Auth.js ports cleanly
if I ever have more than one admin.
Rejected: Brave / Tavily / Exa search as a tool the bot can call mid-conversation. Why: that abandons the deterministic, vetted-corpus trust model. The bot would answer about topics it's never been vetted on, citing third-party pages that might not be accurate. That's a different product entirely — a retrieval-augmented agent, not a reliable bio bot. If I ever want that capability, it lives in its own tool, not polluted into this one.
Everything I haven't done yet, and honest reasons why.
Ask it about my career, my AI product work, my writing, anything from my background. Every question you ask helps me make it better.
Open sambobo.com and try the bot →