Reluctant Buddha: Fine-Tuned LLM for shitty insights

Fine-tuned Llama 3.2 (1B parameter) with only 400 data samples and quantized to Q4_K_M GGUF for efficient CPU inference. Running on CPU only (i5-7500T 4 cores/4 threads) with 16GB RAM (1.5GB used).

Project Overview#

This project involved fine-tuning Meta’s Llama 3.2 (1B parameter model) to create “The Reluctant Buddha” - an enlightened yet jaded entity with a distinct personality mixing internet vernacular with profound wisdom. The project covered the complete ML lifecycle from dataset creation to model deployment via a web interface.

Project Goals#

Learn the fine-tuning process for modern LLMs
Create a specialized chatbot with a unique personality
Deploy the model via a FastAPI backend connected to a React frontend
Make the model accessible for users to interact with directly

The Dataset#

The project began with creating a specialized dataset of ~1,000 prompt-response pairs that captured the essence of the “Reluctant Buddha” character. (current version based on 400 samples) 200 “gold standard” examples were created using Claude Sonnect 3.7 and Deepseek V3, and 800 expanded variations were created using Granite 3.2.

{
 "prompt": "How do I find real friends?",
 "response": ">real friends\n>*snorts in digital disbelief while farting*\n\n\nYou're looking for \"real\" friends, anon? In this age of social media echo chambers and curated identities? Bwahaha. Good luck with that..."
}

The dataset featured:

Chan-style formatting (greentext markers, action asterisks)
Internet vernacular and slang
Profound wisdom wrapped in dismissive, nihilistic delivery
Unique ending patterns (digital burps, farts, zen koans)

Fine-Tuning Process#

Technology Stack#

Framework: Unsloth for efficient fine-tuning
Base Model: Meta’s Llama 3.2 1B parameter model
Environment: Google Colab with T4 GPU

Key Technical Steps#

Data Preparation:
- Transformed JSON dataset into the correct format for Llama 3.1/3.2 chat templates
- Standardized conversations into the proper role/content format
- Applied chat templating using get_chat_template
Model Configuration:
- Applied LoRA (Low-Rank Adaptation) with r=16 to efficiently tune the model
- Targeted key projection matrices (q_proj, k_proj, v_proj, etc.)
- Used gradient checkpointing to optimize memory usage
Training Process:
- Utilized SFTTrainer from TRL library
- Applied response-only training to focus loss on assistant outputs
- Trained for 120 steps (multiple epochs over the dataset)
- Used 8-bit optimizers for memory efficiency
Model Export:
- Saved model in multiple formats (PyTorch, GGUF)
- Quantized to Q4_K_M for efficient CPU inference

Inference and Deployment#

CPU Inference Optimization#

A key challenge was achieving acceptable inference speed on CPU-only environments. After experimenting with various approaches, settled on llama-cpp-python which provided:

Near-Ollama speeds on CPU
Low memory footprint
Streaming capabilities for UX

Backend Development#

Created a FastAPI server to:

Load the quantized model efficiently
Stream responses token-by-token
Handle system prompts to maintain character consistency
Process incoming requests asynchronously

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_cpp import Llama

app = FastAPI()
model = Llama(model_path="model/reluctant-budda.Q4_K_M.gguf", n_ctx=2048, n_threads=4)

@app.post("/chat")
async def chat_endpoint(request: Request):
    data = await request.json()
    user_message = data.get("message", "")
    
    # Create proper system+user prompt
    full_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are 'The Reluctant Buddha'...
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_message}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """
    
    # Stream response
    def generate():
        response = model.create_completion(
            prompt=full_prompt,
            max_tokens=512,
            temperature=1.5,
            stream=True
        )
        for chunk in response:
            if 'text' in chunk:
                yield chunk['text']
    
    return StreamingResponse(generate(), media_type="text/plain")

Frontend Integration#

Developed a React frontend that:

Provides a chat-like interface
Handles streaming responses with proper formatting
Maintains conversation history
Styles the output to enhance readability of chan-style formatting

Challenges and Solutions#

Training Efficiency
- Challenge: Limited GPU resources on free Colab
- Solution: Used Unsloth’s optimizations and 4-bit quantization
Model Output Control
- Challenge: Ensuring responses maintained character’s style without being too short/long
- Solution: Careful system prompt engineering and appropriate temperature/sampling parameters
Inference Speed
- Challenge: Slow inference with PyTorch on CPU
- Solution: Switched to llama-cpp-python with GGUF quantization
Deployment
- Challenge: Serving LLMs efficiently with token streaming
- Solution: FastAPI with streaming responses and asynchronous request handling

Results and Learnings#

The project resulted in a fully functional chatbot that successfully captures the unique voice of the “Reluctant Buddha” character. Key learnings included:

The importance of dataset quality and consistency for character fine-tuning
How to effectively use LoRA for efficient adaptation of large models
The critical impact of inference optimization for user experience
Techniques for streaming LLM outputs in a web application

Future Improvements#

Potential next steps include:

Expanding the dataset for more diverse responses
Fine-tuning on a larger base model (3B or 8B) for improved capabilities
Adding memory to maintain conversation context
Implementing user feedback collection for continual improvement

Conclusion#

This project demonstrated the complete lifecycle of creating a specialized LLM application - from data preparation and fine-tuning to deployment as a web service. It showcases how even smaller models (1B parameters) can be effectively specialized for particular use cases, and how modern techniques like LoRA and quantization make this process accessible even with limited resources.

The “Reluctant Buddha” now exists as both a technical achievement and a unique digital entity, ready to dispense its particular brand of wisdom to anyone seeking enlightenment (or just a good laugh).