Written by 12:03 am Artificial Intelligence, Coding

Making My Own Karaoke Videos with AI

Discover how to create custom karaoke videos on demand with Python! Follow this step-by-step guide …
karaoke blues

The other day I was at a Karaoke event and was frustrated that they didn’t have the song I wanted. I have an extremely limited range vocally. I can do Tom Waits, Bob Dylan and non-autotuned Kanye and even then, only the slower songs. Essentially, any male vocalist known more for their unique voice than their singing prowess fits my range.

Given those limitations, there’s not a lot of choice in the pop and country oriented karaoke apps that fit my “unique” abilities. Even the reliable YouTube search for “song name artist karaoke version” often falls short.

That particular night I was thinking about the somewhat obscure Dylan song “Every Grain of Sand” – a rare classic from Dylan’s “Christian period” in the late 70s to early 80s.

I found one result on Youtube – a homemade karaoke video that someone recorded in their living room. It didn’t have the right tempo and, worst of all, used the album lyrics with the phrase “I am hanging in the balance of a perfect finished plan.” instead of the earlier, and in my opinion, superior line “I am hanging in the balance of the reality of man.”

While I sat listening to a perfect rendition of “Man, I feel like a woman!” by one of the regulars, I hatched a plan to whip up some python to get some robots to build a karaoke version of any song on demand. Here’s how that went:

Setting up the project and environment

The first step is of course to create a little workshop on your machine to work out of. I created a new folder in my venerable ~/code/ directory named custom-karaoke and got to work.

Since we’re using python, let’s first create an environment to contain all our dependencies and code. We’re going to use venv to create a virtual environment to contain our dependencies and keep everything managed locally.

mkdir custom-karaoke
cd custom-karaoke
python -m venv venv
./venv/Scripts/activate
touch main.py

You will need to have ImageMagick installed on your system. If ImageMagick is installed in the standard location, you shouldn’t need to tell moviepy where it is. If you do need to specify the location, you can call the change_settings function from moviepy.config. I use the following code to ensure the script works on my Macbook and my Windows PC:

from moviepy.config import change_settings

import platform

if platform.system() == "Darwin":
    imagemagick_path = "/opt/homebrew/bin/magick"
elif platform.system() == "Windows":
    imagemagick_path = "C:/Program Files/ImageMagick-7.1.1-Q16-HDRI/magick.exe"
else:
    raise NotImplementedError("Unsupported operating system")

change_settings({"IMAGEMAGICK_BINARY": imagemagick_path})

Now that we’ve got the rudiments of a python project up, let’s write some code! First, we’ll create a running python script with our main.py. Your IDE might complain about us importing things we’re not using yet, but we will get to those soon enough.

import argparse
import os

import demucs.api
import torch
import whisper
from moviepy.editor import *
from moviepy.video.tools.subtitles import SubtitlesClip

from moviepy.config import change_settings

from whisper.utils import get_writer
import platform


if platform.system() == "Darwin":
    imagemagick_path = "/opt/homebrew/bin/magick"
elif platform.system() == "Windows":
    imagemagick_path = "C:/Program Files/ImageMagick-7.1.1-Q16-HDRI/magick.exe"
else:
    raise NotImplementedError("Unsupported operating system")

change_settings({"IMAGEMAGICK_BINARY": imagemagick_path})


def parse_arguments():
    parser = argparse.ArgumentParser(
        description="Create a karaoke video from a video file.",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )

    parser.add_argument(
        "video_path", help="Path to the video file.")

    return parser.parse_args()

def main():
    args = parse_arguments()
    video_path = args.video_path.replace("\\", "/")
    print(f"\nProcessing {video_path}.")

if __name__ == "__main__":
    main()

And just like that we have a running program – it’s not particularly useful, but it’s a start. You can run this file like so:

python ./main.py ../path-to/video-file.mp4

With any luck this will print out Processing ../path-to/video-file.mp4 and exit.

As you may have guessed, the next step is… actually processing the video file. Before we can do that we need to figure out what it means to process the file. We are going to want to:

  1. Extract the audio of the file
  2. Split that audio file into a vocals track and a music track
  3. Transcribe the contents of the vocals track into a subtitle file
  4. Create a video that contains the video of the original overlayed with the lyrics as they’re being sung and the audio of the music track sans vocals

Sounds simple! Of course, we’re not going to have to invent ways to do each of these steps, we’re going to be relying on some innovative and powerful libraries published by some of the smartest coders in the world.

Extract the Audio of the file

For this step, we’re going to use the moviepy library – which is mostly a python wrapper over the essential tool ffmpeg. Here’s a function to create an mp3 audio file from our video file:

def video_to_mp3(video_path: str):
    """Converts a video file to an mp3 file."""
    print(f"Converting video to mp3 -> {video_path}")
    audio_path = video_path.replace(".mp4", ".mp3")
    if os.path.exists(audio_path):
        return audio_path

    audio = AudioFileClip(video_path)
    audio.write_audiofile(audio_path, logger="bar")
    print(f"Audio saved to: {audio_path}")
    return audio_path

This function takes a video file path and using AudioFileClip (which is provided by moviepy) creates an audio file which is saved next to our original file with the extension .mp3 and returns the path to the mp3 file.

You’ll notice a pattern at the beginning of the function that we will be repeating later. Sometimes things will go wrong and it’s frustrating having to start all over and run all these conversions again. With this check we can look for the mp3 file and skip creating it in subsequent runs. (More on this later)

In order to run this code we’re going to need to supply the moviepy library. Within the project directory, run:

pip install moviepy

Then you can call the function from your main function:


def main():
    args = parse_arguments()
    video_path = args.video_path.replace("\\", "/")
    print(f"\nProcessing {video_path}.")

    audio_path = video_to_mp3(video_path)
    print(f"Created audio file: {audio_path}.")

Now when you run the script (python ./main.py ../path-to/video-file.mp4) you will see the second message and should find the mp3 file sitting next to your original video file. You can run the script as much as you want, it will return the same file unless you delete it.

Split the audio file into a vocals track and a music track

The hardest part of this section is remembering how to spell separate. We’re using the dmucs library to split the audio track into four separate stems: vocals, drums, bass and the ambiguously named “other.” We will be using the vocal stem to create a transcript and combining the other stems into a single music track. We will use the demucs library to perform this magic.

Install demucs – I’ve found it necessary to install directly from github like so:

pip install -U git+https://github.com/facebookresearch/demucs#egg=demucs

Now create a function in main.py to split the stems.

def separate_stems(audio_file_path: str) -> tuple[str, str]:
    """Separates vocals and music from an audio file."""

    if not os.path.exists("./stems"):
        os.makedirs("./stems")

    audio_filename = audio_file_path.split("/")[-1]

    if os.path.exists(f"./stems/vocals_{audio_filename}"):
        return f"./stems/vocals_{audio_filename}", f"./stems/music_{audio_filename}"

    separator = demucs.api.Separator(progress=True, jobs=4)

    _, separated = separator.separate_audio_file(audio_file_path)

    for stem, source in separated.items():
        demucs.api.save_audio(
            source, f"./stems/{stem}_{audio_filename}", samplerate=separator.samplerate)

    demucs.api.save_audio(
        separated["other"] + separated["drums"] + separated["bass"], f"./stems/music_{audio_filename}", samplerate=separator.samplerate)

    return f"./stems/vocals_{audio_filename}", f"./stems/music_{audio_filename}"

Transcribe the vocal track into a subtitle file

Now that we have a clean vocal track we can create a subtitle file (.srt) by using Whisper, OpenAI’s audio transcription tool. There are a number of Whisper related projects that add useful functionality, speed up transcription or offer additional features.

Whisper Alternatives

For this example, we’ll stick with the original by installing it like so:

pip install openai-whisper

Take note of the openai- bit, more than once I’ve accidentally installed whisper which is a completely unrelated package.

Whisper will run on the CPU – slowly. If you have an Nvidia GPU you can run whisper at a much swifter clip by installing torch and using the CUDA libraries. To set that up visit the PyTorch installation page and copy the command it provides you based on your setup. Mine looks like this:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Now we just need to create a function to create the transcription of our vocal track.

def transcribe(audiofile_path: str, num_passes: int = 1) -> str:
    """
    Converts an MP3 file to a transcript using Whisper

    Args:
        audiofile_path (str): The file path of the MP3 file to be processed.
        num_passes (int): Number of transcription passes to perform.
    Returns:
        str: The path to the SRT file containing the transcript from the last pass.
    """
    try:

        subtitle_path = os.path.join("./subtitles", os.path.splitext(
            os.path.basename(audiofile_path))[0] + '.srt')

        if os.path.exists(subtitle_path):
            return subtitle_path

        if not os.path.exists("./subtitles"):
            os.makedirs("./subtitles")

        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        model = whisper.load_model("large-v2").to(device)

        last_result = None
        for i in range(num_passes):
            print(f"Transcription pass {i + 1} of {num_passes}...")
            current_result = model.transcribe(
                audiofile_path, verbose=True, language="en", word_timestamps=True)
            last_result = current_result

        if last_result is None:
            raise ValueError("No transcription results obtained.")

        srt_writer = get_writer("srt", "./subtitles")
        srt_writer(last_result, audiofile_path, highlight_words=True)

        return subtitle_path

    except Exception as e:
        print(f"Error converting MP3 to transcript: {e}")
        return ""

This function loads a whisper model and runs the transcribe method on our audio file. There are several models available. I’ve had the best luck with large-v2. Oddly enough, large-v3 gave me consistently worse results. The smaller models need less memory to run. If you are limited on memory you can try medium.en which is sufficiently robust to get the transcription mostly right. We’re using the srt_writer function from Whisper to convert our results into a subtitle file (srt).

This brings us to the real point of our ‘short-circuit’ code included in each of these functions. After your first run you’ll be able to assess how well the transcription went. There are often at least a few mistakes that can be manually edited. After you fix the mistakes you can run the script again and it will start with your edited subtitles and not re-do all the work we’ve done so far.

Did you notice the argument word_timestamps that we passed to the transcribe function? That is particularly important in our use case. That will allow us to highlight specific word being sung at the time.

Here’s an example showing off the word timestamps feature

Another helpful feature is the num_passes argument. By default we’re going to run the transcription just once. However, especially when running a smaller model, you can get better results by letting the model warm up for a cycle or two. If you find the subtitle quality to be poor but don’t have the memory for a larger model you can run the smaller models a couple times and see if you get better results.

Putting It All Together

Now to mash all of these bits back together into a useable karaoke video. First let’s update our main function to call the new create function that we will be writing below:

def main():
    args = parse_arguments()
    video_path = args.video_path.replace("\\", "/")

    print(f"Creating karaoke video..")

    create(video_path)

Now we can implement all of the functions we created above and produce a useable karaoke video.

def create(video_path: str):
    """Creates a karaoke video from the separated audio files and the original video file.

    Args:
        video_path (str): The path to the original video file.

    Returns:
        str: The filename of the created karaoke video.
    """

    audio_path = video_to_mp3(video_path)

    vocals_path, music_path = separate_stems(audio_path)

    music = AudioFileClip(music_path).set_fps(44100)
    vocals_audio = AudioFileClip(vocals_path).volumex(0.05).set_fps(44100)

    combined_audio = CompositeAudioClip([music, vocals_audio])

    background_video = VideoFileClip(video_path, target_resolution=(720, 1280)).set_fps(
        30).set_duration(combined_audio.duration)

    dimmed_background_video = background_video.fl_image(
        lambda image: (image * 0.3).astype("uint8"))

    subtitles_file = transcribe(vocals_path, 1)

    def generator(txt):
        """Generates the subtitles for the karaoke video.

        Args:
            txt (str): The text to be added to the subtitles.

        Returns:
            TextClip: The subtitle text clip.
        """

        return TextClip(
            txt,
            font="Impact Condensed",
            fontsize=36,
            color="#FFFFFF",
            stroke_color="#000000",
            stroke_width=0.5,
            size=(1200, None),
            method='pango'
        )

    subtitles = SubtitlesClip(subtitles_file, generator)

    result = CompositeVideoClip([
        dimmed_background_video,
        subtitles.set_position(('center', 'center'), relative=True)
    ]).set_audio(combined_audio)

    filename = f"karaoke_{os.path.basename(video_path)}"
    if not os.path.exists("./output"):
        os.makedirs("./output")
    result.write_videofile(f"./output/{filename}", fps=30, threads=4)

    return filename

This function starts off with the video path we passed in and then creates the mp3 audio file and separates the stems using the functions we created above. The next step is optional, but I like when the original vocals are just barely audible in the background. It’s helpful for a guy with little to no rhythm like myself. To do this we create a new CompositeAudioClip using our music track and the vocals track turned down to 5% volume.

We then create a copy of the original video with the screen dimmed. Note that because we cached our outputs in this script, you can swap out the original video with another after the first run, as long as they have the same name. I’ve used this trick when I have a video with pristine audio but a static or otherwise poor visual with a more visually interesting video. In the first karaoke video I tried, Every Grain of Sand, the only source material for an intelligible version of Dylan’s lyrics that I could find had a static background of the album cover. I replaced the background video in the final version with a live version that was more visually interesting but, as is typical with late Dylan performances, mostly mumbling along to the melody.

At this point we create our subtitle file and then define a generator function that will run for every frame displaying the text of the lyric being sung at that moment.

Because we used the argument word_timestamps in our transcribe call and we selected the method pango, the specific word will be wrapped in a <u>tag</u> which will underline it. If you want to decorate the word differently you can replace those tags in the generator with anything that pango can render. You can explore those options in the pango documentation – but for our purposes, you’re better off checking out these examples.

The TextClip that the generator returns can be customized with your choice of font, font size, color and so on. The size argument requires a tuple of the (width, height) of the text. We want to restrict the width to a bit smaller than our output. We set our output in the background_video above. The height we leave as None so that it will be dynamically sized depending on the length of the text.

Finally, we stack our background video and subtitles video along with our music track into our final karaoke style video and save it to disk.

You can find the complete code at this GitHub repository. Feel free to fork the project and add your own features.

By setting up this custom karaoke solution, I’ve ensured that I’ll always have the perfect song ready to match my vocal range. Whether you’re also limited in your singing capabilities or just enjoy a custom karaoke experience, I hope you find this guide helpful. Happy singing!

Helpful Tips

If you’re looking to download source material from the internet, yt-dlp is a great tool that can extract and save videos locally from a wide variety of sources like youtube, twitter and more.

When I use yt-dlp I typically add the flag -f b which will default the highest resolution mp4. I hate dealing with .webm and .mkv files and the mp4 is good enough for our purposes.

For this project I created a local folder named inputs to store my source videos. It’s listed in the .gitignore file to keep those binary files out of source control.

cd inputs
yt-dlp https://video.site/watch?v=ABC123 -f b

You can choose any font on your system, but some fonts act a little wonky when being formatted with pango.

Start with a short song! It’s much better to find out that you have a bug after a few seconds of processing than after several minutes.

Try updating the font to something unique. Try messing with the relative volumes of the tracks. Try changing the SRT to display the song title before the lyrics. This is a great project to use to improve your python skills!

Add a requirements.txt with the following content:

moviepy
demucs @ git+https://github.com/facebookresearch/demucs@e976d93ecc3865e5757426930257e200846a520a
openai-whisper

With this you can install all of your 3rd party packages in one command. Don’t forget to activate your virtual environment before you do!

Examples

Close