AI Auto Transcription System

Introduction

About two years ago, right as the LLM craze was really popping off, I stumbled on WhisperAI–a tool released by OpenAI that let you transcribe audio (voice-to-text). I actually uncovered it when I was looking at a way to dictate some of my personal notes in Neovim (vimwiki). I found nerd-dictation [https://github.com/ideasman42/nerd-dictation] which was super amazing, but didn’t work on my macbook and died on me as linux began the march away from x11 to wayland. Nerd-dictation required some of the security…“openness” that X11 afforded. I switched to wayland to get something working on my Pop_OS! machine, and when I did that, I lost my ability to use nerd-dictation.

That’s was around the time I heard about WhisperCPP on a TWiT podcast. It’s a fantastic project by Georgi Gerganov, and you can find information on it here https://github.com/ggerganov/whisper.cpp. As an aside, Georgi Gerganov is the same guy of gguf fame. A leader in AI development/AI tool development / open source. He’s ported the OpenAI Whisper code to C++ and even added support for the Core ML cores on Apple Silicon. His implementation is the fastest I’ve seen so far (especially when taking advantage of the ML cores on the M-series chips). When I started playing around with it I realized it could do much more than dictation. Which gave me the idea for transcribing the podcasts I listen to.

Why?

As an avid listener of the TWiT (This Week in Tech) network for the last 18 years 20 years, I often find myself trying to remember what someone said, how something was phrased, or “what was it that they said about privacy regulation?” Obviously, it’s pretty tough to remember the episode (or even the right show), and it’s impractical to re-listen to thousands of hours of audio. So I had the idea of transcribing the audio and video so that it’s easily searchable text.

I want to say I was on paternity leave when I wrote the first very kludgey version of this tool. It consisted of a couple of poorly written python scripts. Some of which I copy and pasted from stackoverflow and various forums on the internet. It was glued together with bash scripts and cron jobs…and as you can imagine it was incredibly fragile. Every time I rebooted the machine I had to go relaunch things. I tried to automate this but it was so finicky that I often needed to ssh into the machine and do it manually.

I could blame it on sleep deprivation but my lack of software development skill was probably the biggest culprit. It was literally hacked together in the truest sense of the word.

v2.0

Fast forward to now. The AI hype cycle is at peak hype and billions of VC dollars have been invested in LLMs and other AI tech.

Then here I was, letting my AI tool languish on my server, not taking advantage of the progress of the last few years.

There were a few reasons I decided to dust this off:

  1. I wanted to reorganize my server rack. I wanted to virtualise my truenas server (my old gaming machine). To do that, I needed to free up the PCI slot on my proxmox server that my old GPU was sitting it. That was the GPU I was using to do the transcripts. Due to a limitation of the motherboard on my proxmox server, I can only pass through that PCI slot, and I need it for my HBA card to give me enough SATA connections for my truenas hard drives. Ultimately, this would let me spin down that server and save a few bucks on electricity (or, lets be honest, use it for something else and not save electricity).
  2. ROCm finally doesn’t suck butt, and I figured I could run the transcription process on my desktop now using my Radeon 6900xt. It would let me use a much larger (more accurate model) and would be faster than my macbook air or the 1070ti in the server. Which btw, I bought secondhand from my brother-in-law specifically to do the transcription work since it was so painful to use ROCm back then. I literally spent $175 to get an nvidia card to avoid the pain that was ROCm.
  3. The aforementioned fragileness of my old setup resulted in it needing more manual intervention than I had time to provide so it just languished, broken, for a few months.
  4. Part of the old system required me to manually go in and “identify” speakers then find/replace the “SPEAKER_01” with “Leo Laporte” etc. I knew that an LLM should be able to do this.
  5. Claude Code could, theoretically, fill in the programming skill gap that plagued my previous attempt.

Technical Solution

Core Components

flowchart TD A[RSS Feeds] --> B[Audio Downloader] B --> C[Audio Storage] C --> D1[Transcription
Whisper/WhisperX] C --> D2[Diarization
pyannote.audio] D1 --> E[Merge Segments] D2 --> E E --> F[LLM Speaker ID] F --> G[Export Formats
TXT/JSON/SRT] G --> H[File Management]

Key Technologies

  • Transcription Engine: OpenAI Whisper
  • Speaker Diarization: WhisperX + pyannote.audio
  • LLM Integration: Optional GPT/Claude integration
  • Infrastructure: Python 3.10+, AMD ROCm/NVIDIA CUDA support

Unique Features

Three-Tier Transcript System

It generates three distinct versions of each transcript to serve different use cases:

1. Raw Transcription Output

Obviously a purely raw output is nice to have but its not the most ‘usable’.

Welcome to the Intelligent Machines podcast. Today we're diving deep into the world of artificial intelligence and machine learning. I'm excited to have our guest here to discuss the latest developments in neural networks and their applications in real-world scenarios. The field has been advancing at an incredible pace, especially with the emergence of large language models and their impact on various industries.

2. Speaker-Diarized Version

The diarized version, aka the version with the speakers being differentiated, is even better.

SPEAKER_01: Welcome to the Intelligent Machines podcast. Today we're diving deep into the world of artificial intelligence and machine learning.

SPEAKER_01: I'm excited to have our guest here to discuss the latest developments in neural networks and their applications in real-world scenarios.

SPEAKER_02: Thanks for having me on the show.

SPEAKER_01: The field has been advancing at an incredible pace, especially with the emergence of large language models and their impact on various industries.

SPEAKER_02: Absolutely, and what's particularly fascinating is how these models are being integrated into everyday applications.

3. LLM-Enhanced with Real Speaker Names

This is the real piece-de-resistance. Like I said before, I use to have to go into the diarized version (like you see above) and use context clues to figure out who SPEAKER_01 is. That was a very time consuming manual task and honestly I never got around to doing it for the thousands of episodes I transcribed and diarized.

But this new version uses an LLM to identify the speakers and make the changes. This part is trickier than it seems because the diarization part isn’t the most accurate. It can get confused and think there are more speakers than their are because of Ads or things like overlapping conversation (which happens all of the time on podcasts and real life). Pyannote can struggle distinguishing speakers in those cases. But if you prompt an LLM well enough…

Leo Laporte: Welcome to the Intelligent Machines podcast. Today we're diving deep into the world of artificial intelligence and machine learning.

Leo Laporte: I'm excited to have our guest here to discuss the latest developments in neural networks and their applications in real-world scenarios.

Dr. Sarah Chen: Thanks for having me on the show.

Leo Laporte: The field has been advancing at an incredible pace, especially with the emergence of large language models and their impact on various industries.

Dr. Sarah Chen: Absolutely, and what's particularly fascinating is how these models are being integrated into everyday applications.

Automation Capabilities

RSS Feed Batch Processing

VoxDex can process multiple podcast feeds automatically by configuring RSS sources:

# config.yaml
rss_feeds:
  - name: "TWiT This Week in Tech"
    url: "https://feeds.twit.tv/twit.xml"
    enabled: true
  - name: "AI Podcast"
    url: "https://feeds.example.com/ai-podcast.xml"
    enabled: true
    
processing:
  max_episodes_per_run: 5
  check_interval_hours: 6
# RSS processing example
def process_rss_feeds(config):
    for feed in config['rss_feeds']:
        if feed['enabled']:
            episodes = fetch_new_episodes(feed['url'])
            for episode in episodes[:config['processing']['max_episodes_per_run']]:
                download_and_process(episode)

Configurable File Retention

Automatic cleanup based on age and storage limits:

# File retention configuration
retention:
  audio_files:
    keep_days: 30
    max_size_gb: 100
  transcripts:
    keep_days: 365
    backup_to_cloud: true
  temp_files:
    cleanup_immediately: true
# Cleanup implementation
def cleanup_old_files():
    # Remove audio files older than 30 days
    for file in get_audio_files():
        if file.age_days > config.retention.audio_files.keep_days:
            file.delete()
    
    # Archive old transcripts to cloud storage
    for transcript in get_old_transcripts():
        if config.retention.transcripts.backup_to_cloud:
            upload_to_cloud(transcript)

Multiple Output Formats

Generate TXT, JSON, and SRT simultaneously:

# Export to multiple formats
def export_transcript(segments, episode_metadata):
    base_filename = f"{episode_metadata['show']}_{episode_metadata['date']}"
    
    # Plain text format
    with open(f"{base_filename}.txt", "w") as f:
        for segment in segments:
            f.write(f"{segment['speaker']}: {segment['text']}\n\n")
    
    # JSON format with metadata
    json_output = {
        "metadata": episode_metadata,
        "segments": [
            {
                "start_time": seg['start'],
                "end_time": seg['end'],
                "speaker": seg['speaker'],
                "text": seg['text']
            } for seg in segments
        ]
    }
    
    with open(f"{base_filename}.json", "w") as f:
        json.dump(json_output, f, indent=2)
    
    # SRT subtitle format
    with open(f"{base_filename}.srt", "w") as f:
        for i, segment in enumerate(segments, 1):
            f.write(f"{i}\n")
            f.write(f"{format_time(segment['start'])} --> {format_time(segment['end'])}\n")
            f.write(f"{segment['speaker']}: {segment['text']}\n\n")
Sample Outputs

TXT Format (intelligent_machines_2024-10-05.txt):

Leo Laporte: Welcome to the Intelligent Machines podcast. Today we're diving deep into the world of artificial intelligence and machine learning.

Leo Laporte: I'm excited to have our guest here to discuss the latest developments in neural networks and their applications in real-world scenarios.

Dr. Sarah Chen: Thanks for having me on the show.

Leo Laporte: The field has been advancing at an incredible pace, especially with the emergence of large language models and their impact on various industries.

Dr. Sarah Chen: Absolutely, and what's particularly fascinating is how these models are being integrated into everyday applications.

JSON Format (intelligent_machines_2024-10-05.json):

{
  "metadata": {
    "show": "Intelligent Machines",
    "date": "2024-10-05",
    "episode_title": "Neural Networks in Practice",
    "duration": "3600",
    "file_size": "156MB"
  },
  "segments": [
    {
      "start_time": "00:00:00",
      "end_time": "00:00:08",
      "speaker": "Leo Laporte",
      "text": "Welcome to the Intelligent Machines podcast. Today we're diving deep into the world of artificial intelligence and machine learning."
    },
    {
      "start_time": "00:00:08",
      "end_time": "00:00:16",
      "speaker": "Leo Laporte", 
      "text": "I'm excited to have our guest here to discuss the latest developments in neural networks and their applications in real-world scenarios."
    },
    {
      "start_time": "00:00:16",
      "end_time": "00:00:18",
      "speaker": "Dr. Sarah Chen",
      "text": "Thanks for having me on the show."
    },
    {
      "start_time": "00:00:18",
      "end_time": "00:00:28",
      "speaker": "Leo Laporte",
      "text": "The field has been advancing at an incredible pace, especially with the emergence of large language models and their impact on various industries."
    },
    {
      "start_time": "00:00:28",
      "end_time": "00:00:35",
      "speaker": "Dr. Sarah Chen",
      "text": "Absolutely, and what's particularly fascinating is how these models are being integrated into everyday applications."
    }
  ]
}

SRT Format (intelligent_machines_2024-10-05.srt):

1
00:00:00,000 --> 00:00:08,000
Leo Laporte: Welcome to the Intelligent Machines podcast. Today we're diving deep into the world of artificial intelligence and machine learning.

2
00:00:08,000 --> 00:00:16,000
Leo Laporte: I'm excited to have our guest here to discuss the latest developments in neural networks and their applications in real-world scenarios.

3
00:00:16,000 --> 00:00:18,000
Dr. Sarah Chen: Thanks for having me on the show.

4
00:00:18,000 --> 00:00:28,000
Leo Laporte: The field has been advancing at an incredible pace, especially with the emergence of large language models and their impact on various industries.

5
00:00:28,000 --> 00:00:35,000
Dr. Sarah Chen: Absolutely, and what's particularly fascinating is how these models are being integrated into everyday applications.

Technical Challenges & Solutions

  • Speaker diarization accuracy:

The pyannote project has made some great strides and is impressive software. There is a huge active community on huggingface.com supporting it. That said, it still struggles a bit at identifying unique speakers when

  1. you don’t tell it exactly how many voices it should expect (and that’s not as straightforward as it seems because you have ads, they play clips of audio, etc.)
  2. people over-talk each other which happens all the time during normal conversation and especially on conversational podcasts where latency and your typical zoom lag come into play.

Its good enough though, and when you combine it with an LLM you can pretty accurately identify who is saying what.

One thing I had to prompt the LLM with was letting it know that the same person could be SPEAKER_01 and SPEAKER_05 because of the imprecise nature of the diarization from pyannote. The LLM, initially, would only assign a name to one of those, which makes sense if you are an LLM and assume each “SPEAKER_nn” is unique. So, I had to let it know that different “SPEAKER"s could be the same person and to use further context clues.

  • GPU optimization:

I have another post about getting ROCm working here. It was pretty seamless to get the whisperX code working locally with my GPU. I haven’t done much in terms of optimizing. I changed to use the largest (English only) model which my GPU can handle. There is probably room for more tuning here. For now, I don’t run it often enough and only do a handful of podcasts so if this isn’t even going to move the needle on my electric bill.

  • LLM integration challenges:

I wasn’t having much luck with the smaller (cheaper) LLMs like gpt5nano. I could probably spend a little more time tuning my prompts or tweaking the sampling code to make it work better for me, but I decided to just throw the bigger model at it and call it a day. Its not too expensive at this point. I need to work out the math but its probably not much more than a small coffee at Starbucks (per month).

  • File management and processing pipeline:

I added some configurable features to prune the downloaded podcasts. One of the issues I ran into with my old clunky setup was running out of disk space because I downloaded, and never deleted, ALL of the episodes on my limited VM filesystem.

# Retention configuration for managing storage
retention:
  # Audio files (raw downloaded episodes)
  audio_files:
    keep_days: 7          # Delete after 1 week
    max_size_gb: 50       # Clean oldest when storage exceeds limit
    
  # Transcript files (.txt, .json, .srt)
  transcripts:
    keep_days: 365        # Keep transcripts for 1 year
    backup_before_delete: true
    backup_location: "/path/to/backup"
    
  # Temporary processing files
  temp_files:
    cleanup_immediately: true
    keep_on_error: true   # Keep temp files if processing fails
    
  # Failed processing attempts
  failed_downloads:
    retry_after_days: 3
    max_retries: 3
    delete_after_days: 30

Next Steps

So whats next for this project? My plan is to run it for a month then check the results and performance. I am also planning on looking into a better way to index this for search or more meaningful use. Having raw text files is great because there are so many simple tools or utilities that can help dig through them (fzf, rg, telescope in neovim), but there is value in putting it into a database or Elasticsearch. There are some text analysis tools that could be fun to throw at it too (eg sentiment analysis, theme extraction, etc).

Other things I want to do at some point:

  • See if I can get a local LLM to perform as well as ChatGPT 5. It would be pretty cool to do all of this locally. (Though the cost to do this with the openai api is literally pennies).
  • Create a utility script that can take my OLD diarized podcasts and run it through the LLM enhancement tool. I dont think I want to spend the time or energy re-processing thousands of old shows. The diarization was “ok” and I think the LLM enhancement piece would clean it up so its usable.
  • Clean up my GitHub repository: https://github.com/wesgould/voxdex
  • Getting started guide