Audio-to-Text App Fine-Tuning

Learning Generated in 9.87s (Audio: 78.68s, Transcription: 1.30s, LLM: 8.57s)

Cleaned Transcript

Built a simple app that takes audio notes and transcribes them using either the Whisper model or the Parakeet model. Then it uses the raw transcription and feeds it into a local LLM, Lama 3.2 with 3 billion parameters. Applied fine-tuned LoRa on top of that, low-order rank adapter, which means this model is very capable of cleaning up transcriptions and doing some processing on the note. Did this by creating synthetic dataset based on a handful of real transcripts, then generated new example transcripts, then used those with a golden state-of-the-art model using Shoots. I think I used Kimi K2, and then fine-tuned the model using Unslot. Took around four hours over 40,000 examples at batch size 16.

Summary

Fine-tuned a local LLM on audio transcription data using synthetic datasets and golden state-of-the-art models, achieving good cleaning capabilities.

Tags

#fine-tuning#llm#audio-transcription#dataset-generation

Key Points

  • Used Whisper or Parakeet for initial transcription
  • Local LLM: Lama 3.2 with 3B parameters
  • Fine-tuned LoRa adapter on top of LLM
  • Created synthetic dataset from real transcripts
  • Used Shoots/Kimi K2 golden model
  • 40k examples at batch size 16, 4 hours total

Decisions

  • Use LoRa adapter for cleaning transcriptions

Entities

Whisper (PRODUCT) Parakeet (PRODUCT) Lama 3.2 (PRODUCT) LoRa (PRODUCT) Shoots (PRODUCT) Kimi K2 (PRODUCT) Unslot (PRODUCT)

Time References

four hours over 40,000 examples at batch size 16 → (DURATION)