Why RAG is the difference between a homework helper and a homework hazard
Ask ChatGPT to help a Year 7 student with enzymes in the digestive system and you'll probably get a decent answer. But look carefully and you'll notice it's using terms like "alimentary canal" when the KS3 curriculum says "digestive tract." It might introduce pepsinogen, which is accurate but not on the syllabus until A-level. It'll describe the pH of the stomach as "around 2" when the revision guide your child actually uses says "between 1 and 2." None of these are wrong, exactly. They're just wrong enough to be confusing for a 12-year-old who's trying to match what the AI says to what their teacher said in class.
This is the fundamental problem with using a general-purpose AI model for education. The model was trained on the entire internet. It doesn't know which curriculum your child follows, which year group they're in, or which terminology their teacher prefers. It answers from the average of everything it's ever seen, and that average is heavily weighted towards American English, university-level content, and whatever happened to be most common on the web.
Athena fixes this with a technique called RAG (Retrieval-Augmented Generation). Before answering any question, the AI searches a library of hand-authored curriculum content that I've written specifically for UK KS3 students. It uses those documents as its primary source of truth. If the curriculum content says "cell membrane," Athena says "cell membrane." Not "plasma membrane," not "phospholipid bilayer," not whatever Wikipedia's first paragraph happens to use.
This post walks through how I built that system: the three-layer knowledge architecture, the content pipeline, the authoring standards, and why I chose a managed RAG service over running my own vector database.
The Three-Layer Knowledge System
Athena doesn't just have RAG switched on or off. It uses a three-layer priority system, and every response is tagged with which layer provided the answer. This is encoded directly in the system prompt that the AI receives before every conversation:
Layer 1: Curriculum Content (highest priority). When Athena has curriculum content available for a topic, it must use the definitions, formulae, and facts from that content exactly as written. No paraphrasing in a way that changes the meaning. No improvising. If my content file says the mitochondria are "where energy is released from food through aerobic respiration," that's what Athena says. Not "the powerhouse of the cell," which is a meme, not a KS3 definition.
Layer 2: Web Search (trusted domains only). When the curriculum library doesn't cover a topic and web search is enabled, Gemini can query Google, but only results from a whitelist of trusted educational domains get used: BBC Bitesize, Khan Academy, Oak National Academy, Seneca Learning, Britannica, and government education sites. Results from Reddit, Quora, blogs, and social media are ignored. When Athena uses web-sourced information, it says something natural like "I looked this up to make sure I got it right for you" rather than dumping raw URLs on a 12-year-old.
Layer 3: General Knowledge (last resort). When neither the curriculum library nor web search contributed to the answer, Athena flags it honestly: "I don't have my notes on this to hand. Double-check with your teacher." It will never invent specific numbers, dates, or formulae when unsure. The instruction is explicit: it is far better to say you're not certain than to state something incorrect.
After each response, the server analyses the grounding metadata from the Gemini API to determine which layer actually sourced the answer. This analysis distinguishes between RAG chunks (Layer 1), web search results (Layer 2), and ungrounded model knowledge (Layer 3) by inspecting the response's GroundingMetadata, checking for retrieval chunks, web URIs, and confidence scores. The result gets sent back to the frontend alongside the response, so I could theoretically display a confidence indicator to the student (I haven't yet, but the data is there).
The key design decision here is that the layers are a priority order, not a blend. If curriculum content contributed to any part of the answer, the entire response is tagged as RAG-grounded. Layer 3 is only used when both retrieval sources came back empty. This prevents the model from quietly mixing grounded facts with hallucinated details in the same response.
The Content Pipeline
The RAG system lives in a separate directory called athena-rag/, deliberately decoupled from the main Next.js application. Content authoring and AI serving have different lifecycles. I don't want to redeploy the app every time I add a new topic to the curriculum library.
The pipeline is four numbered Python scripts, designed to run in sequence:
Step 1: Create stores. 01_create_stores.py creates one Gemini File Search Store per subject. I currently have stores for Biology, Maths, English, Science, Geography, History, French, Arabic, Art, Computing, and Business, though only Biology, Maths, English, and Science have content uploaded so far. The store names get saved to a store_registry.json file that both the pipeline and the Next.js app reference.
Step 2: Upload content. 02_upload_content.py walks the content/ directory, finds all markdown files, parses their filenames into metadata, and uploads them to the appropriate store. The filename convention does the heavy lifting here. A file named y7_cells.md automatically gets tagged with year_group=7, topic=cells, difficulty=core, and subject=science. The script uploads the file with structured metadata so the AI can filter retrieval by year group and topic at query time.
Step 3: Test queries. 03_test_query.py runs a battery of test questions against each store, covering both raw retrieval tests (no system prompt, just "does the store return relevant content?") and Socratic tests (with Athena's full system prompt, simulating real usage). This catches problems before they reach the student: if a Biology question about cells doesn't retrieve the cells content file, I know something's wrong with the upload or the metadata.
Step 4: Manage stores. 04_manage_stores.py handles listing, inspecting, and (with confirmation) deleting stores. Useful for cleanup and debugging.
The whole pipeline runs in under five minutes for the current content library. Adding a new topic is straightforward: write a markdown file, drop it in the right content/<subject>/ folder, run the upload script, run the test script, done.
Content Authoring Standards
This is where the real work lives. The RAG system is only as good as the content it retrieves, and "just upload the textbook" doesn't work. Textbooks are written for students to read. RAG content is written for an AI tutor to reference while having a Socratic conversation. The tone, structure, and level of detail need to be different.
Every content file follows a mandatory four-section template:
Key Knowledge. The core facts, definitions, and concepts the student needs. Written at Year 7 reading level (ages 11-12), with no jargon that isn't immediately defined. This is Athena's primary reference. When a student asks "what does the nucleus do?", the AI should find the answer here, in exactly the language appropriate for their age group.
Here's a real excerpt from y7_cells.md:
The nucleus controls everything the cell does. It contains genetic material called DNA, which carries the instructions for making proteins and running the cell. Think of it as the cell's control centre.
Notice the deliberate simplicity. No mention of chromatin, no "eukaryotic" versus "prokaryotic." That's not because those things don't exist. It's because they're not on the KS3 syllabus and would confuse a Year 7 student who's encountering cells for the first time. Any teacher reading this will recognise the instinct: you wouldn't introduce those terms in September of Year 7 either.
Common Misconceptions. Mistakes students frequently make, so Athena can watch for them and address them proactively. Each content file includes at least three. For cells, one of the most persistent is:
Students sometimes think plant cells do not have mitochondria because they have chloroplasts. This is wrong. Plant cells have BOTH mitochondria AND chloroplasts.
When a student says "plant cells use chloroplasts instead of mitochondria," Athena recognises this as a known misconception from its curriculum content and can address it specifically, rather than giving a generic correction.
Key Vocabulary. Subject-specific terms with Year 7-appropriate definitions. The cells file includes ten terms: cell, nucleus, cytoplasm, cell membrane, mitochondria, ribosomes, cell wall, chloroplasts, vacuole, and specialised cell. These serve double duty: they give Athena the correct terminology to use, and they provide definitions it can reference when a student asks "what does that word mean?"
Worked Examples. Step-by-step solutions the AI can draw on when helping a student through a problem. Biology files need at least two; Maths files need at least three (because mathematical procedures benefit from more examples). Each worked example includes the question, the answer, and the reasoning, so Athena can walk a student through the logic rather than just stating the final answer.
Beyond the four sections, there's a quality checklist every file must pass: correct structure, year-appropriate language, UK KS3 curriculum alignment, factual accuracy, minimum content thresholds, consistent formatting, unambiguous maths notation, and an overall tone of "Athena's internal reference material, not a student-facing worksheet."
Why Managed RAG (and Not a Vector Database)
When most developers hear "RAG," they think: chunk the documents, embed them, store them in Pinecone or Weaviate or pgvector, build a retrieval pipeline, tune the chunk size, fiddle with the embedding model, and add a reranker. That's a lot of infrastructure for a personal project.
Gemini's File Search Stores are a managed alternative. I upload my curriculum markdown files, attach metadata, and the store handles chunking, embedding, indexing, and retrieval. At query time, I pass the store name to the Gemini API and it automatically retrieves relevant content before generating a response. No vector database to host. No embedding model to choose. No chunk size to tune.
The trade-offs are real. I don't control the chunking strategy. I can't swap in a different embedding model. I can't rerank results. I'm limited to 10 stores per API key and 5 stores per query. But for an educational app serving one student with a well-structured content library, these constraints don't bite. The content files are carefully authored to be self-contained by topic, so chunking quality matters less than it would with, say, a messy PDF textbook.
The cost is negligible: $0.15 per million tokens for one-time indexing at upload. Storage is free. Query-time retrieval is free. For the current content library, a few dozen markdown files totalling maybe 50,000 words, the indexing cost rounds to zero.
If I were building this for a whole school with hundreds of content files and thousands of students, I'd probably need a self-hosted solution for cost control and customisation. But for a single-user app, managed RAG removes an entire category of infrastructure from my to-do list.
How It All Connects at Query Time
When a student sends a message, the chat API route assembles the retrieval tools based on their current subject:
- The
buildRAGTool()function looks up the store name for the student's subject from the environment variables (which reference the store registry). - If web search is enabled,
buildSearchTool()adds a Google Search Retrieval tool with a dynamic threshold of 0.3. - Both tools are passed to the Gemini model alongside the system prompt and conversation history.
- Gemini retrieves relevant curriculum content from the File Search Store, uses it to generate a Socratic response, and includes grounding metadata in the API response.
- The server calls
analyzeGrounding()to inspect the metadata, counting retrieval chunks, averaging confidence scores, and identifying web URIs, then tags the response with its source layer.
The student never sees any of this. They ask a question about fractions and get a Socratic response grounded in the exact curriculum content I authored. If the content library doesn't cover their question, the AI searches trusted web sources. If that fails too, it's honest about it. The three layers degrade gracefully rather than silently hallucinating.
What This Means for Education
The RAG architecture has implications well beyond my child's homework, and this is the part I think matters most to educators:
Curriculum content becomes a first-class input. Teachers could author content files aligned to their scheme of work, using their department's preferred methods and terminology. The AI would teach using exactly those methods. A Maths department that teaches column addition before number lines could encode that preference. A Biology department that uses specific diagrams could reference them. This is "AI teaches what you tell it to teach," not "AI knows best."
Misconception libraries are powerful. The Common Misconceptions sections are possibly the most valuable part of the content files. A generic AI doesn't know that Year 7 students consistently confuse the cell membrane with the cell wall, or that they think plants don't have mitochondria. But an AI with access to a curated misconception library can catch these errors in real time and address them with the specific counter-explanation a teacher would use.
Grounding analysis creates accountability. Because every response is tagged with its source layer, you can audit how often the AI is relying on curriculum content versus falling back to general knowledge. If Layer 3 responses are frequent for a particular topic, that's a signal the content library has a gap that needs filling.
The content is the product. The AI model, the chat interface, the hosting: those are commodity infrastructure. Anyone can set them up. The curriculum content library is the thing that makes an educational AI tool actually useful, and that's the part only teachers can create. Technology people build the pipe. Educators fill it with water.
What's Next
The content library is still sparse. Biology and Maths have solid Year 7 coverage, and Science and English stores are populated, but Geography, History, French, Arabic, Art, Computing, and Business are still empty stores waiting for content. Full Year 7 coverage across all subjects is the next milestone.
The next post in the series covers the negotiation engine: the system that detects when my child is frustrated, offers a structured hint trade, and requires verification that they actually understood the help they received. That's the feature that turns Athena from "a chatbot with a curriculum" into something that genuinely teaches.