Fine-Tuning Local Models for Education
Cleaned Transcript
So the reason I wanted to fine-tune a small local model was because I eventually wanted it to be deployed and used by people that don't have powerful computers, which means we want it to be CPU friendly and also use it without an internet connection. I chose Quen 2.57B, seven billion parameters, because I kept reading about cognitive breakthroughs that come from the seven billion parameters area. I didn't choose Quen 3 because in my initial testing I kept getting weird errors when trying to save the LoRa or the merged model. So I would want to revisit this with Quen 3 8 billion parameters, probably using non-thinking because I've noticed that Quen3 thinks quite verbosely and takes time for the response, which isn't ideal. The plan was pretty simple: use open router and other API providers to create a dataset using Wikipedia articles as input. I used the Hugging Face Wikipedia dataset and did category checks for each so they're related to tech or science, but it still includes articles on people or institutions. It's not just scientific processes themselves but anything relating to that. I didn't want the fine-tune to be limited to high quality dense wiki articles because people might want to create learning materials from different types of articles such as biographies and histories overviews. The schema was pretty simple: it's in JSON and would be passed via HTML page or a React component. So I needed the model to know how to write in JSON so all of the dataset outputs, examples, they were JSON objects and the schema included in the prompt along with simple instructions like take this Wikipedia article and create study guides and concept maps and timelines. I used the free tier on Operator so used models like Kimi or Deep Seek on Quen 3 and over less than a week I am asked about 10,000 dataset samples for the study guides and 10,000 for the concept maps. Then what I did was fine-tune each dataset using RUMPOD. On one of them I used an RTX 5090. I've got my training scripts and used batch size of 64. Learning rate is there although parameters are there and it took me about an hour for each fine-tune. Once I got the LoRa adapters, merged them into the GGUFs and basically tested it and they ran pretty decently. One issue I did run into was choosing to train a Q LoRa which is a quantized LoRa using the 4-bit quantized version of Quen 2.5 7b. Because of this I needed a certain setup when doing inference, and that setup wasn't supported by my Mac, which meant it was trying to load the full 7b parameter model and doing all sorts and basically running out of memory. Couldn't get it to run on Windows OneSlot either. So another only option is to install Ubuntu or something on there and make sure Xformers and Triton and all other libraries are working properly so I can load a quantized that's got 8 gig RAM so it should fit the four and a half gig model plus through the 4GB for context. I did train at 16k tokens because most Wikipedia articles, especially dense ones, can get quite high like 13 to 16k and 13 to 15k, and gave myself a little bit of buffer for actual output as well. Of course it still works with small articles but wanted it ready for the big juicy articles basically. I was impressed with initial results. However I did envision a system where I could load the model once and then for each Wikipedia article load first order process everything, then for second bit create concept maps and timelines and do same thing with already tokenized Wikipedia article. But realized that tokenizing it will save couple of seconds max. What really want to do is see if can reuse key value KV context cache because essentially thinking why am I processing long Wikipedia article twice fully? If can do once then run first adapter, run second adapter, merge both outputs before sending user, that'd be amazing. But even if not, embarking on new experiment basically to unify both datasets into one. Essentially what that will do is tag dataset entries by what they are - study guides or concept maps on timelines - and from that route behavior depending on tags used during training. For instance for study guides there would be token/tag for study guides initial examples. When include that token/tag during inference should activate layers teaching how to produce study guides and same with concept maps. Realized all dataset examples were adult level university level depending on complexity of article. Want this to be used for school children around world so it's adaptable where user could say write this for key stage two, key stage three, key stage four even if it's article on gradient descent which is normally quite advanced. Taking 5000 examples from each dataset and creating key stage three example and key stage four example for each. So in total that should give about 20k examples per where's that 20k overall examples. No I think it's per so 40k in total or something. Basically gonna be lot of data. Also thinking now should sample some for key stages two, one as well and college/university. Probably don't need to do it for university but might as well just so tagging behavior is consistent. New unified dataset can train to have single quantized model without having to load different models. Thinking 50-50 mix. Actually most of time need it to be fully spot on with study guide stuff because that includes flashcards, summaries, mnemonics questions and concept maps are good companion for tutor to teach but not necessary in sense. So thinking waiting 40% and study guides 60%, and also includes key stage examples as well. Creating unified dataset, train new model, and try training on Quen 3 this time. See how that goes and see how it roots behavior with tags.
Summary
Fine-tuned Quen 2.57B for education using Wikipedia articles; created unified dataset with study guides and concept maps; considering Quen 3 8b model.
Tags
Key Points
- Fine-tuned Quen 2.57B (7B params) for education
- Used Hugging Face Wikipedia dataset with category checks
- Processed 10k samples each for study guides and concept maps
- Used RUMPOD fine-tuning with RTX 5090
- Quantized LoRa training issues on Mac/OneSlot
- Considering Quen 3 8B model with non-thinking
- Creating unified dataset with tags for behavior routing
Action Items
Decisions
- Will use 50-50 split of 40% study guides and 60% concept maps