The dream of a personal AI tutor for every child is becoming a reality. A new system can take any Wikipedia article (e.g Aristotle) and instantly generate a rich, multi-faceted learning experience. This approach creates a dynamic educational toolkit from a single source.
The Vision
My approach now uses a single Large Language Model (LLM) with one lightweight adapter (LoRA) trained on a unified dataset. Two capabilities are routed by control tags embedded in the prompt:
<STUDY_GUIDE>
for study guides (summaries, Q&A, flashcards, key terms)<CONCEPT_MAP_TIMELINE>
for concept maps and timelines
Mode 1: Study Guide
Generates a compact study guide: summaries, key terms, short Q&A, and flashcards—all from the same source.
Mode 2: Concept Map & Timeline
Creates concept maps and timelines to visualize connections and sequences within the topic.
The user picks a Wikipedia article, and the system routes the prompt with tags to build a complete learning module. For example, after processing the Wikipedia article on “Anarchism,” the <CONCEPT_MAP_TIMELINE>
mode produces a concept map connecting key figures to their core ideas and a timeline tracing the evolution of anarchist thought.
The Technical Challenge
This leads to a critical performance question - how can we reduce the prompt processing at each step?
A Wikipedia article can be very long. The naive approach would be:
- Feed the full article +
<STUDY_GUIDE>
prompt to the model. - Feed the full article +
<CONCEPT_MAP_TIMELINE>
prompt to the model.
Processing that much text multiple times is slow and computationally expensive, especially on CPU-only hardware. It is the primary bottleneck to making this system practical.
Can we reuse the KV cache across modes?
KV caches represent the model’s internal state for a specific prefix under a specific set of weights. Reusing a cache reliably requires that:
- The same base weights and the same adapters remain active between the cache build and generation.
- Your inference stack supports prefix/prompt caching as a first-class feature (many do, but not all expose it cleanly in every API).
Because a LoRA changes the effective weights, computing a cache without the LoRA and then generating with the LoRA can be inconsistent. Likewise, swapping adapters between cache build and decode generally breaks cache validity. In plain eager transformers
, starting a fresh generate()
from an externally built past_key_values
is also fragile/not well supported.
Things to explore:
- Prefix/prompt caching engines: Run a single LoRA and use servers that support prefix caching (e.g., vLLM’s prefix caching, llama.cpp prompt cache). Send the article once as a shared prefix, then issue two requests with
<STUDY_GUIDE>
and<CONCEPT_MAP_TIMELINE>
. You get reuse without swapping adapters. - One-pass, structured output: Ask for both artifacts in one generation (clearly delimited sections) and split afterward. Cheapest CPU path; one read of the article. Might need further fine tuning to ensure it outputs both - or i could combine the JSON object from both datasets into 1.
- Outline-and-refine: First produce a dense outline/notes with citations from the article, then feed only the outline when generating each artifact.
- Chunk-and-retrieve: Embed the article once, store in a vector index, and retrieve relevant chunks per task. Keeps prompts short and avoids refeeding the whole article, but may miss out key information.
I’m treating KV-cache reuse across modes as an optimization to evaluate, not a guarantee. The approaches above are practical today and can get close to the same user experience.