Overview

Concepts

hev ask is three parts: a heading-chunk index with real anchors, a bounded agentic loop that navigates a knowledge graph, and that offline knowledge graph — a distilled, source-grounded “shadow site” the loop reads from. This page explains each and how they fit together.

       BUILD TIME (CLI / Skill)                RUNTIME (edge)
  ╔═════════════════════════════════╗    ╔════════════════════════════════════╗░
  ║  ask kg build                   ║    ║  /api/ask   (prerender: false)     ║░
  ║                                 ║    ║                                    ║░
  ║  glob src/content/docs/**       ║    ║  ┌─── keyword mode · no key ────┐  ║░
  ║   → chunk by heading            ║    ║  │ prefilter chunks + glossary  │  ║░
  ║   → sha256 content hash         ║    ║  └──────────────────────────────┘  ║░
  ║   → Opus 4.8 builds the graph   ║    ║  ┌──── agentic loop · Haiku ────┐  ║░
  ║   → write .hev-ask/kg.json      ║    ║  │ system: kg.context (cached)  │  ║░
  ║                                 ║    ║  │ tool: search(q)  ≤ 4 times   │  ║░
  ╚═════════════════════════════════╝    ║  │ then stream answer, no tools │  ║░
   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    ║  │ grounded in /page#anchor     │  ║░
                                         ║  └──────────────────────────────┘  ║░
     .hev-ask/kg.json  (committed)       ╚════════════════════════════════════╝░
           │                              ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
           ▼
     virtual:hev-ask/kg  ──── bundled into /api/ask ────▶

The split that matters: the knowledge graph is built offline with a strong model and committed to git; the index and search loop run on demand at the edge. No durable state lives in the running site.

Chunks and anchors

hev ask does not index pages — it indexes sections. Each document is split on its headings (up to a configurable depth, default ## and ###). Content before the first heading becomes the intro chunk whose URL is the page itself.

Each chunk carries the section’s heading, its cleaned prose, and a URL of the form basePath + slug + #anchor. The anchor is generated with github-slugger — the same slugger Astro’s renderer uses — so the link lands on a heading that actually exists in the rendered HTML.

Both code paths chunk through the same function: the runtime index feeds it getCollection entries, and the offline build feeds it disk-parsed files. One source of truth for slugs means the anchors agree.

Keyword search and the glossary

When the reader types a single term, the server runs a dependency-free prefilter over the chunks:

Expand the query. Each term picks up its glossary aliases (k8s → kubernetes) and the tokens of any matched glossary term. The expansion is additive and capped.
Score by token overlap between the expanded query and each chunk — widened by the knowledge graph. When a kg.json with nodes is present, a match also counts against that section’s distilled summary, its terms, and its verbatim facts, so the sections the graph considers central to a term rank above an incidental mention buried in body prose.
Cap per document (default 2 chunks per doc) so one long page can’t monopolize the results, then take the top pool.
Excerpt around the first matched term for the snippet.

This is the instant path. It needs no API key and no embeddings — just the chunk index and the committed graph. With no graph it degrades to plain token overlap over the raw chunk text, so keyword search always works.

Asking is the default

The overlay is ask-first. A single word is treated as a keyword lookup and answered instantly from the index. The moment the query grows past one word — the reader types a space — the overlay stops the keyword type-ahead and switches to ask mode: pressing Enter sends the question to the agentic loop. On open it also shows a few suggested questions (baked into the knowledge graph at build time, so they cost nothing at runtime) to make asking the obvious move.

None of this is forced. A reader can flip the overlay to keyword-only (persisted in localStorage); then a space just searches for a phrase and the model is never called. See the overlay reference for the exact interaction.

The agentic search loop

When the reader asks (presses Enter on a multi-word query, or clicks a suggested question, with an API key present), the query goes to a bounded tool-use loop that ends by streaming a grounded answer. It runs in two phases.

Phase 1 — gather. The model is given a map of every section (id + summary) up front, plus one tool:

open_section({ id }) — opens a section to read its distilled summary, its verbatim facts (flags, code, identifiers), and — for reference sections — its source text. The model opens the sections it needs; each open is streamed to the overlay as a faint activity line. It may only cite sections it opened.

The model decides when it has enough context. It opens up to maxIterations rounds (default 4); when it stops opening, it’s ready to answer.

Phase 2 — answer. The accumulated sources are sent to the overlay (so it can validate links), then the model is called once more with no tools — so it can only write prose — and its answer is streamed token-by-token. The system prompt instructs it to ground every claim in the retrieved sections, link to them inline using their exact url, and say plainly when the docs don’t cover the question. Dropping the tools on the final turn is what guarantees the model answers instead of searching again.

The system prompt is cached

The knowledge graph’s map and section summaries are injected into the system prompt with a cache_control marker, so across the rounds it’s a prompt-cache hit rather than re-sent tokens. The answer turn changes the tool set (it has none), so it can’t reuse the search rounds’ cache — but it’s the last call anyway. The loop model defaults to Claude Haiku 4.5 and is configurable.

The knowledge graph

The knowledge graph (kg.json) is built offline and committed to your repo as a reviewable artifact — the agent’s view of your docs. It holds:

nodes — one distilled section per heading: a summary the agent reasons from, the verbatim facts (flags, code, identifiers) it quotes exactly, and a source link back to the real page. Reference sections are marked source-primary so the agent reads their source text rather than a paraphrase.
overview — a deterministic map of every section, injected up front.
context and glossary — a compact orientation and the terms (with aliases) that widen keyword search.
suggestions — a handful of natural questions the docs can answer, shown in the overlay on open.

Only the summary, glossary, context, and suggestions are model-authored; the node structure, facts, overview, and content hash are derived deterministically in code. So the model only ever supplies the distillation — which is exactly the part worth doing in a Claude Code skill.

Two ways to build it

The model step can run two ways, and both feed the same deterministic assembler:

Claude Code skill (recommended). A skill walks Claude through reading the corpus and writing the distillation, then assembles kg.json locally. It runs inside your existing Claude Code subscription — no ANTHROPIC_API_KEY, no per-build token spend on your own key, and it fits the editor workflow most authors are already in.
ask kg build (fallback). One ANTHROPIC_API_KEY call to Claude Opus 4.8 (default) does the same distillation unattended — the right choice for CI or anyone not using Claude Code.

Either way the build is hash-gated: it hashes the concatenated chunk text and, if a committed kg.json with nodes already matches that hash, it does no model work at all. The JSON is reviewed in pull requests like any other change. See the knowledge graph reference and the CLI reference for the file format and the build commands.

Degradation, by design

hev ask is built to keep working as pieces drop away:

No key at runtime → keyword mode only. The overlay still searches.
No key at build → the committed kg.json is kept; the build warns but never fails for lack of a key.
No kg.json, or an older node-less one → the agentic loop falls back to keyword-style retrieval; keyword ranking falls back to raw token overlap; and the overlay simply shows no suggested questions. Everything still works.
Stale kg.json → the runtime logs a one-line warning when the live index hash differs from the graph’s hash, but still serves.

For the boundaries of what it can do, read Limits; for what you’re choosing by adopting it, read Tradeoffs.