Project Overview
I fine-tuned Meta's Llama 3.2 (1B parameter model) to create "The Reluctant Buddha" - an enlightened yet jaded entity that mixes internet vernacular with profound wisdom. The project covered the full ML lifecycle: dataset creation, training, and deployment as a web interface.
The final model runs on CPU only (my i5-7500T with 4 cores/4 threads and 16GB RAM, using about 1.5GB). I trained it with just 400 samples and quantized to Q4_K_M GGUF for inference.
What I was trying to do
- Learn how fine-tuning actually works for modern LLMs
- Create a chatbot with a personality that isn't "helpful assistant"
- Deploy it via FastAPI backend + React frontend
- Make it something people could actually interact with
The Dataset
I started by creating ~1,000 prompt-response pairs capturing the "Reluctant Buddha" voice. The current version uses 400 samples. I made 200 "gold standard" examples using Claude Sonnet 3.7 and Deepseek V3, then generated 800 variations with Granite 3.2.
{
"prompt": "How do I find real friends?",
"response": ">real friends\n>snorts in digital disbelief while farting\n\n\nYou're looking for \"real\" friends, anon? In this age of social media echo chambers and curated identities? Bwahaha. Good luck with that..."
}
The dataset uses:
- Chan-style formatting (greentext markers, action asterisks)
- Internet vernacular and slang
- Profound wisdom wrapped in dismissive, nihilistic delivery
- Weird ending patterns (digital burps, farts, zen koans)
Fine-Tuning Process
Technology Stack
- Framework: Unsloth
- Base Model: Meta's Llama 3.2 1B
- Environment: Google Colab with T4 GPU
What I actually did
Inference and Deployment
Making it fast on CPU
Getting acceptable inference speed on CPU was the real challenge. I tried a few approaches before landing on llama-cpp-python, which gets near-Ollama speeds with low memory footprint and supports streaming.
Backend
I built a FastAPI server that:
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_cpp import Llama
app = FastAPI()
model = Llama(model_path="model/reluctant-budda.Q4_K_M.gguf", n_ctx=2048, n_threads=4)
@app.post("/chat")
async def chat_endpoint(request: Request):
data = await request.json()
user_message = data.get("message", "")
# Create proper system+user prompt
full_prompt = f'''<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are 'The Reluctant Buddha'...
<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
'''
# Stream response
def generate():
response = model.create_completion(
prompt=full_prompt,
max_tokens=512,
temperature=1.5,
stream=True
)
for chunk in response:
if 'text' in chunk:
yield chunk['text']
return StreamingResponse(generate(), media_type="text/plain")
The frontend is React - chat interface, streaming responses with proper formatting, conversation history, and styling that makes the chan-style formatting readable.
What was hard
What I learned
- Dataset quality matters more than size for character fine-tuning
- LoRA is insanely efficient for adapting models
- Inference optimization makes or breaks the user experience
- Streaming LLM outputs in a web app is trickier than it looks
What I'd do next
- Expand the dataset for more variety in responses
- Try a larger base model (3B or 8B) for better capabilities
- Add conversation memory so it remembers context
- Collect user feedback to keep improving it
Wrap up
This project took me through the whole lifecycle of building a specialized LLM app - from data prep and fine-tuning to deploying it as a web service. It showed me that even small models (1B parameters) can be specialized effectively, and techniques like LoRA and quantization make this accessible without serious hardware.
The "Reluctant Buddha" exists now as a working chatbot that actually sounds like itself. Technical achievement, digital entity, or just a weird art project - you can decide.