All-in-1 AI Tutor: Processing a Wikipedia Article Just Once

The dream of a personal AI tutor for every child is becoming a reality. A new system can take any Wikipedia article (e.g Aristotle) and instantly generate a rich, multi-faceted learning experience. This approach creates a dynamic educational toolkit from a single source.

The Vision

My approach now uses a single Large Language Model (LLM) with one lightweight adapter (LoRA) trained on a unified dataset. Two capabilities are routed by control tags embedded in the prompt:

<STUDY_GUIDE> for study guides (summaries, Q&A, flashcards, key terms)
<CONCEPT_MAP_TIMELINE> for concept maps and timelines

Mode 1: Study Guide

Generates a compact study guide: summaries, key terms, short Q&A, and flashcards—all from the same source.

Mode 2: Concept Map & Timeline

Creates concept maps and timelines to visualize connections and sequences within the topic.

The user picks a Wikipedia article, and the system routes the prompt with tags to build a complete learning module. For example, after processing the Wikipedia article on “Anarchism,” the <CONCEPT_MAP_TIMELINE> mode produces a concept map connecting key figures to their core ideas and a timeline tracing the evolution of anarchist thought.

Diagram showing the process flow from a Wikipedia article to tag-routed study outputs.

The Technical Challenge

This leads to a critical performance question - how can we reduce the prompt processing at each step?

A Wikipedia article can be very long. The naive approach would be:

Feed the full article + <STUDY_GUIDE> prompt to the model.
Feed the full article + <CONCEPT_MAP_TIMELINE> prompt to the model.

Processing that much text multiple times is slow and computationally expensive, especially on CPU-only hardware. It is the primary bottleneck to making this system practical.

Can we reuse the KV cache across modes?

KV caches represent the model’s internal state for a specific prefix under a specific set of weights. Reusing a cache reliably requires that:

The same base weights and the same adapters remain active between the cache build and generation.
Your inference stack supports prefix/prompt caching as a first-class feature (many do, but not all expose it cleanly in every API).

Because a LoRA changes the effective weights, computing a cache without the LoRA and then generating with the LoRA can be inconsistent. Likewise, swapping adapters between cache build and decode generally breaks cache validity. In plain eager transformers, starting a fresh generate() from an externally built past_key_values is also fragile/not well supported.

Things to explore:

Prefix/prompt caching engines: Run a single LoRA and use servers that support prefix caching (e.g., vLLM’s prefix caching, llama.cpp prompt cache). Send the article once as a shared prefix, then issue two requests with <STUDY_GUIDE> and <CONCEPT_MAP_TIMELINE>. You get reuse without swapping adapters.
One-pass, structured output: Ask for both artifacts in one generation (clearly delimited sections) and split afterward. Cheapest CPU path; one read of the article. Might need further fine tuning to ensure it outputs both - or i could combine the JSON object from both datasets into 1.
Outline-and-refine: First produce a dense outline/notes with citations from the article, then feed only the outline when generating each artifact.
Chunk-and-retrieve: Embed the article once, store in a vector index, and retrieve relevant chunks per task. Keeps prompts short and avoids refeeding the whole article, but may miss out key information.

I’m treating KV-cache reuse across modes as an optimization to evaluate, not a guarantee. The approaches above are practical today and can get close to the same user experience.

See a Full Study Guide Example