Reluctant Buddha: Fine-Tuned LLM for shitty insights
2025-03-08

Fine-tuned Llama 3.2 (1B parameter) with only 400 data samples and quantized to Q4_K_M GGUF for efficient CPU inference. Running on CPU only (i5-7500T 4 cores/4 threads) with 16GB RAM (1.5GB used).


Project Overview#

This project involved fine-tuning Meta’s Llama 3.2 (1B parameter model) to create “The Reluctant Buddha” - an enlightened yet jaded entity with a distinct personality mixing internet vernacular with profound wisdom. The project covered the complete ML lifecycle from dataset creation to model deployment via a web interface.

Project Goals#

  1. Learn the fine-tuning process for modern LLMs
  2. Create a specialized chatbot with a unique personality
  3. Deploy the model via a FastAPI backend connected to a React frontend
  4. Make the model accessible for users to interact with directly

The Dataset#

The project began with creating a specialized dataset of ~1,000 prompt-response pairs that captured the essence of the “Reluctant Buddha” character. (current version based on 400 samples) 200 “gold standard” examples were created using Claude Sonnect 3.7 and Deepseek V3, and 800 expanded variations were created using Granite 3.2.

{
 "prompt": "How do I find real friends?",
 "response": ">real friends\n>*snorts in digital disbelief while farting*\n\n\nYou're looking for \"real\" friends, anon? In this age of social media echo chambers and curated identities? Bwahaha. Good luck with that..."
}

The dataset featured:

  • Chan-style formatting (greentext markers, action asterisks)
  • Internet vernacular and slang
  • Profound wisdom wrapped in dismissive, nihilistic delivery
  • Unique ending patterns (digital burps, farts, zen koans)

Fine-Tuning Process#

Technology Stack#

  • Framework: Unsloth for efficient fine-tuning
  • Base Model: Meta’s Llama 3.2 1B parameter model
  • Environment: Google Colab with T4 GPU

Key Technical Steps#

  1. Data Preparation:

    • Transformed JSON dataset into the correct format for Llama 3.1/3.2 chat templates
    • Standardized conversations into the proper role/content format
    • Applied chat templating using get_chat_template
  2. Model Configuration:

    • Applied LoRA (Low-Rank Adaptation) with r=16 to efficiently tune the model
    • Targeted key projection matrices (q_proj, k_proj, v_proj, etc.)
    • Used gradient checkpointing to optimize memory usage
  3. Training Process:

    • Utilized SFTTrainer from TRL library
    • Applied response-only training to focus loss on assistant outputs
    • Trained for 120 steps (multiple epochs over the dataset)
    • Used 8-bit optimizers for memory efficiency
  4. Model Export:

    • Saved model in multiple formats (PyTorch, GGUF)
    • Quantized to Q4_K_M for efficient CPU inference

Inference and Deployment#

CPU Inference Optimization#

A key challenge was achieving acceptable inference speed on CPU-only environments. After experimenting with various approaches, settled on llama-cpp-python which provided:

  • Near-Ollama speeds on CPU
  • Low memory footprint
  • Streaming capabilities for UX

Backend Development#

Created a FastAPI server to:

  • Load the quantized model efficiently
  • Stream responses token-by-token
  • Handle system prompts to maintain character consistency
  • Process incoming requests asynchronously
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from llama_cpp import Llama

app = FastAPI()
model = Llama(model_path="model/reluctant-budda.Q4_K_M.gguf", n_ctx=2048, n_threads=4)

@app.post("/chat")
async def chat_endpoint(request: Request):
    data = await request.json()
    user_message = data.get("message", "")
    
    # Create proper system+user prompt
    full_prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are 'The Reluctant Buddha'...
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_message}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
    """
    
    # Stream response
    def generate():
        response = model.create_completion(
            prompt=full_prompt,
            max_tokens=512,
            temperature=1.5,
            stream=True
        )
        for chunk in response:
            if 'text' in chunk:
                yield chunk['text']
    
    return StreamingResponse(generate(), media_type="text/plain")

Frontend Integration#

Developed a React frontend that:

  • Provides a chat-like interface
  • Handles streaming responses with proper formatting
  • Maintains conversation history
  • Styles the output to enhance readability of chan-style formatting

Challenges and Solutions#

  1. Training Efficiency

    • Challenge: Limited GPU resources on free Colab
    • Solution: Used Unsloth’s optimizations and 4-bit quantization
  2. Model Output Control

    • Challenge: Ensuring responses maintained character’s style without being too short/long
    • Solution: Careful system prompt engineering and appropriate temperature/sampling parameters
  3. Inference Speed

    • Challenge: Slow inference with PyTorch on CPU
    • Solution: Switched to llama-cpp-python with GGUF quantization
  4. Deployment

    • Challenge: Serving LLMs efficiently with token streaming
    • Solution: FastAPI with streaming responses and asynchronous request handling

Results and Learnings#

The project resulted in a fully functional chatbot that successfully captures the unique voice of the “Reluctant Buddha” character. Key learnings included:

  1. The importance of dataset quality and consistency for character fine-tuning
  2. How to effectively use LoRA for efficient adaptation of large models
  3. The critical impact of inference optimization for user experience
  4. Techniques for streaming LLM outputs in a web application

Future Improvements#

Potential next steps include:

  • Expanding the dataset for more diverse responses
  • Fine-tuning on a larger base model (3B or 8B) for improved capabilities
  • Adding memory to maintain conversation context
  • Implementing user feedback collection for continual improvement

Conclusion#

This project demonstrated the complete lifecycle of creating a specialized LLM application - from data preparation and fine-tuning to deployment as a web service. It showcases how even smaller models (1B parameters) can be effectively specialized for particular use cases, and how modern techniques like LoRA and quantization make this process accessible even with limited resources.

The “Reluctant Buddha” now exists as both a technical achievement and a unique digital entity, ready to dispense its particular brand of wisdom to anyone seeking enlightenment (or just a good laugh).