Audio-to-Text App Fine-Tuning

Cleaned Transcript

Built a simple app that takes audio notes and transcribes them using either the Whisper model or the Parakeet model. Then it uses the raw transcription and feeds it into a local LLM, Lama 3.2 with 3 billion parameters. Applied fine-tuned LoRa on top of that, low-order rank adapter, which means this model is very capable of cleaning up transcriptions and doing some processing on the note. Did this by creating synthetic dataset based on a handful of real transcripts, then generated new example transcripts, then used those with a golden state-of-the-art model using Shoots. I think I used Kimi K2, and then fine-tuned the model using Unslot. Took around four hours over 40,000 examples at batch size 16.

Summary

Fine-tuned a local LLM on audio transcription data using synthetic datasets and golden state-of-the-art models, achieving good cleaning capabilities.

Key Points

Used Whisper or Parakeet for initial transcription
Local LLM: Lama 3.2 with 3B parameters
Fine-tuned LoRa adapter on top of LLM
Created synthetic dataset from real transcripts
Used Shoots/Kimi K2 golden model
40k examples at batch size 16, 4 hours total

Decisions

Use LoRa adapter for cleaning transcriptions

Entities

Whisper (PRODUCT) Parakeet (PRODUCT) Lama 3.2 (PRODUCT) LoRa (PRODUCT) Shoots (PRODUCT) Kimi K2 (PRODUCT) Unslot (PRODUCT)

Time References

four hours over 40,000 examples at batch size 16 → (DURATION)