The other day I was at a Karaoke event and was frustrated that they didn’t have the song I wanted. I have an extremely limited range vocally. I can do Tom Waits, Bob Dylan and non-autotuned Kanye and even then, only the slower songs. Essentially, any male vocalist known more for their unique voice than their singing prowess fits my range.
Given those limitations, there’s not a lot of choice in the pop and country oriented karaoke apps that fit my “unique” abilities. Even the reliable YouTube search for “song name artist karaoke version” often falls short.
That particular night I was thinking about the somewhat obscure Dylan song “Every Grain of Sand” – a rare classic from Dylan’s “Christian period” in the late 70s to early 80s.
I found one result on Youtube – a homemade karaoke video that someone recorded in their living room. It didn’t have the right tempo and, worst of all, used the album lyrics with the phrase “I am hanging in the balance of a perfect finished plan.” instead of the earlier, and in my opinion, superior line “I am hanging in the balance of the reality of man.”
While I sat listening to a perfect rendition of “Man, I feel like a woman!” by one of the regulars, I hatched a plan to whip up some python to get some robots to build a karaoke version of any song on demand. Here’s how that went:
Setting up the project and environment
The first step is of course to create a little workshop on your machine to work out of. I created a new folder in my venerable ~/code/ directory named custom-karaoke
and got to work.
Since we’re using python, let’s first create an environment to contain all our dependencies and code. We’re going to use venv to create a virtual environment to contain our dependencies and keep everything managed locally.
mkdir custom-karaoke
cd custom-karaoke
python -m venv venv
./venv/Scripts/activate
touch main.py
You will need to have ImageMagick installed on your system. If ImageMagick is installed in the standard location, you shouldn’t need to tell moviepy where it is. If you do need to specify the location, you can call the change_settings
function from moviepy.config
. I use the following code to ensure the script works on my Macbook and my Windows PC:
from moviepy.config import change_settings
import platform
if platform.system() == "Darwin":
imagemagick_path = "/opt/homebrew/bin/magick"
elif platform.system() == "Windows":
imagemagick_path = "C:/Program Files/ImageMagick-7.1.1-Q16-HDRI/magick.exe"
else:
raise NotImplementedError("Unsupported operating system")
change_settings({"IMAGEMAGICK_BINARY": imagemagick_path})
Now that we’ve got the rudiments of a python project up, let’s write some code! First, we’ll create a running python script with our main.py. Your IDE might complain about us importing things we’re not using yet, but we will get to those soon enough.
import argparse
import os
import demucs.api
import torch
import whisper
from moviepy.editor import *
from moviepy.video.tools.subtitles import SubtitlesClip
from moviepy.config import change_settings
from whisper.utils import get_writer
import platform
if platform.system() == "Darwin":
imagemagick_path = "/opt/homebrew/bin/magick"
elif platform.system() == "Windows":
imagemagick_path = "C:/Program Files/ImageMagick-7.1.1-Q16-HDRI/magick.exe"
else:
raise NotImplementedError("Unsupported operating system")
change_settings({"IMAGEMAGICK_BINARY": imagemagick_path})
def parse_arguments():
parser = argparse.ArgumentParser(
description="Create a karaoke video from a video file.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"video_path", help="Path to the video file.")
return parser.parse_args()
def main():
args = parse_arguments()
video_path = args.video_path.replace("\\", "/")
print(f"\nProcessing {video_path}.")
if __name__ == "__main__":
main()
And just like that we have a running program – it’s not particularly useful, but it’s a start. You can run this file like so:
python ./main.py ../path-to/video-file.mp4
With any luck this will print out Processing ../path-to/video-file.mp4
and exit.
As you may have guessed, the next step is… actually processing the video file. Before we can do that we need to figure out what it means to process the file. We are going to want to:
- Extract the audio of the file
- Split that audio file into a vocals track and a music track
- Transcribe the contents of the vocals track into a subtitle file
- Create a video that contains the video of the original overlayed with the lyrics as they’re being sung and the audio of the music track sans vocals
Sounds simple! Of course, we’re not going to have to invent ways to do each of these steps, we’re going to be relying on some innovative and powerful libraries published by some of the smartest coders in the world.
Extract the Audio of the file
For this step, we’re going to use the moviepy
library – which is mostly a python wrapper over the essential tool ffmpeg
. Here’s a function to create an mp3 audio file from our video file:
def video_to_mp3(video_path: str):
"""Converts a video file to an mp3 file."""
print(f"Converting video to mp3 -> {video_path}")
audio_path = video_path.replace(".mp4", ".mp3")
if os.path.exists(audio_path):
return audio_path
audio = AudioFileClip(video_path)
audio.write_audiofile(audio_path, logger="bar")
print(f"Audio saved to: {audio_path}")
return audio_path
This function takes a video file path and using AudioFileClip (which is provided by moviepy) creates an audio file which is saved next to our original file with the extension .mp3 and returns the path to the mp3 file.
You’ll notice a pattern at the beginning of the function that we will be repeating later. Sometimes things will go wrong and it’s frustrating having to start all over and run all these conversions again. With this check we can look for the mp3 file and skip creating it in subsequent runs. (More on this later)
In order to run this code we’re going to need to supply the moviepy library. Within the project directory, run:
pip install moviepy
Then you can call the function from your main function:
def main():
args = parse_arguments()
video_path = args.video_path.replace("\\", "/")
print(f"\nProcessing {video_path}.")
audio_path = video_to_mp3(video_path)
print(f"Created audio file: {audio_path}.")
Now when you run the script (python ./main.py ../path-to/video-file.mp4
) you will see the second message and should find the mp3 file sitting next to your original video file. You can run the script as much as you want, it will return the same file unless you delete it.
Split the audio file into a vocals track and a music track
The hardest part of this section is remembering how to spell separate. We’re using the dmucs library to split the audio track into four separate stems: vocals, drums, bass and the ambiguously named “other.” We will be using the vocal stem to create a transcript and combining the other stems into a single music track. We will use the demucs library to perform this magic.
Install demucs – I’ve found it necessary to install directly from github like so:
pip install -U git+https://github.com/facebookresearch/demucs#egg=demucs
Now create a function in main.py to split the stems.
def separate_stems(audio_file_path: str) -> tuple[str, str]:
"""Separates vocals and music from an audio file."""
if not os.path.exists("./stems"):
os.makedirs("./stems")
audio_filename = audio_file_path.split("/")[-1]
if os.path.exists(f"./stems/vocals_{audio_filename}"):
return f"./stems/vocals_{audio_filename}", f"./stems/music_{audio_filename}"
separator = demucs.api.Separator(progress=True, jobs=4)
_, separated = separator.separate_audio_file(audio_file_path)
for stem, source in separated.items():
demucs.api.save_audio(
source, f"./stems/{stem}_{audio_filename}", samplerate=separator.samplerate)
demucs.api.save_audio(
separated["other"] + separated["drums"] + separated["bass"], f"./stems/music_{audio_filename}", samplerate=separator.samplerate)
return f"./stems/vocals_{audio_filename}", f"./stems/music_{audio_filename}"
Transcribe the vocal track into a subtitle file
Now that we have a clean vocal track we can create a subtitle file (.srt) by using Whisper, OpenAI’s audio transcription tool. There are a number of Whisper related projects that add useful functionality, speed up transcription or offer additional features.
Whisper Alternatives
- Whisper.cpp – Port of Whisper in C++.
- WhisperX – Adds fast automatic speaker recognition with word-level timestamps and speaker diarization.
- faster-whisper – Faster reimplementation of Whisper using CTranslate2.
- Whisper JAX – JAX implementation of Whisper for up to 70x speed-up on TPU.
- whisper-timestamped – Adds word-level timestamps and confidence scores.
- whisper-openvino – Whisper running on OpenVINO.
- whisper.tflite – Whisper running on TensorFlow Lite.
- Whisper variants – Various Whisper variants on Hugging Faces.
- Whisper-AT – Whisper that can recognize non-speech audio events in addition to speech.
For this example, we’ll stick with the original by installing it like so:
pip install openai-whisper
Take note of the openai- bit, more than once I’ve accidentally installed whisper
which is a completely unrelated package.
Whisper will run on the CPU – slowly. If you have an Nvidia GPU you can run whisper at a much swifter clip by installing torch and using the CUDA libraries. To set that up visit the PyTorch installation page and copy the command it provides you based on your setup. Mine looks like this:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Now we just need to create a function to create the transcription of our vocal track.
def transcribe(audiofile_path: str, num_passes: int = 1) -> str:
"""
Converts an MP3 file to a transcript using Whisper
Args:
audiofile_path (str): The file path of the MP3 file to be processed.
num_passes (int): Number of transcription passes to perform.
Returns:
str: The path to the SRT file containing the transcript from the last pass.
"""
try:
subtitle_path = os.path.join("./subtitles", os.path.splitext(
os.path.basename(audiofile_path))[0] + '.srt')
if os.path.exists(subtitle_path):
return subtitle_path
if not os.path.exists("./subtitles"):
os.makedirs("./subtitles")
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = whisper.load_model("large-v2").to(device)
last_result = None
for i in range(num_passes):
print(f"Transcription pass {i + 1} of {num_passes}...")
current_result = model.transcribe(
audiofile_path, verbose=True, language="en", word_timestamps=True)
last_result = current_result
if last_result is None:
raise ValueError("No transcription results obtained.")
srt_writer = get_writer("srt", "./subtitles")
srt_writer(last_result, audiofile_path, highlight_words=True)
return subtitle_path
except Exception as e:
print(f"Error converting MP3 to transcript: {e}")
return ""
This function loads a whisper model and runs the transcribe method on our audio file. There are several models available. I’ve had the best luck with large-v2
. Oddly enough, large-v3
gave me consistently worse results. The smaller models need less memory to run. If you are limited on memory you can try medium.en
which is sufficiently robust to get the transcription mostly right. We’re using the srt_writer
function from Whisper to convert our results into a subtitle file (srt).
This brings us to the real point of our ‘short-circuit’ code included in each of these functions. After your first run you’ll be able to assess how well the transcription went. There are often at least a few mistakes that can be manually edited. After you fix the mistakes you can run the script again and it will start with your edited subtitles and not re-do all the work we’ve done so far.
Did you notice the argument word_timestamps
that we passed to the transcribe function? That is particularly important in our use case. That will allow us to highlight specific word being sung at the time.
Another helpful feature is the num_passes
argument. By default we’re going to run the transcription just once. However, especially when running a smaller model, you can get better results by letting the model warm up for a cycle or two. If you find the subtitle quality to be poor but don’t have the memory for a larger model you can run the smaller models a couple times and see if you get better results.
Putting It All Together
Now to mash all of these bits back together into a useable karaoke video. First let’s update our main
function to call the new create
function that we will be writing below:
def main():
args = parse_arguments()
video_path = args.video_path.replace("\\", "/")
print(f"Creating karaoke video..")
create(video_path)
Now we can implement all of the functions we created above and produce a useable karaoke video.
def create(video_path: str):
"""Creates a karaoke video from the separated audio files and the original video file.
Args:
video_path (str): The path to the original video file.
Returns:
str: The filename of the created karaoke video.
"""
audio_path = video_to_mp3(video_path)
vocals_path, music_path = separate_stems(audio_path)
music = AudioFileClip(music_path).set_fps(44100)
vocals_audio = AudioFileClip(vocals_path).volumex(0.05).set_fps(44100)
combined_audio = CompositeAudioClip([music, vocals_audio])
background_video = VideoFileClip(video_path, target_resolution=(720, 1280)).set_fps(
30).set_duration(combined_audio.duration)
dimmed_background_video = background_video.fl_image(
lambda image: (image * 0.3).astype("uint8"))
subtitles_file = transcribe(vocals_path, 1)
def generator(txt):
"""Generates the subtitles for the karaoke video.
Args:
txt (str): The text to be added to the subtitles.
Returns:
TextClip: The subtitle text clip.
"""
return TextClip(
txt,
font="Impact Condensed",
fontsize=36,
color="#FFFFFF",
stroke_color="#000000",
stroke_width=0.5,
size=(1200, None),
method='pango'
)
subtitles = SubtitlesClip(subtitles_file, generator)
result = CompositeVideoClip([
dimmed_background_video,
subtitles.set_position(('center', 'center'), relative=True)
]).set_audio(combined_audio)
filename = f"karaoke_{os.path.basename(video_path)}"
if not os.path.exists("./output"):
os.makedirs("./output")
result.write_videofile(f"./output/{filename}", fps=30, threads=4)
return filename
This function starts off with the video path we passed in and then creates the mp3 audio file and separates the stems using the functions we created above. The next step is optional, but I like when the original vocals are just barely audible in the background. It’s helpful for a guy with little to no rhythm like myself. To do this we create a new CompositeAudioClip
using our music track and the vocals track turned down to 5% volume.
We then create a copy of the original video with the screen dimmed. Note that because we cached our outputs in this script, you can swap out the original video with another after the first run, as long as they have the same name. I’ve used this trick when I have a video with pristine audio but a static or otherwise poor visual with a more visually interesting video. In the first karaoke video I tried, Every Grain of Sand, the only source material for an intelligible version of Dylan’s lyrics that I could find had a static background of the album cover. I replaced the background video in the final version with a live version that was more visually interesting but, as is typical with late Dylan performances, mostly mumbling along to the melody.
At this point we create our subtitle file and then define a generator function that will run for every frame displaying the text of the lyric being sung at that moment.
Because we used the argument word_timestamps
in our transcribe call and we selected the method pango
, the specific word will be wrapped in a <u>tag</u>
which will underline it. If you want to decorate the word differently you can replace those tags in the generator with anything that pango
can render. You can explore those options in the pango
documentation – but for our purposes, you’re better off checking out these examples.
The TextClip
that the generator returns can be customized with your choice of font, font size, color and so on. The size argument requires a tuple of the (width, height) of the text. We want to restrict the width to a bit smaller than our output. We set our output in the background_video
above. The height we leave as None
so that it will be dynamically sized depending on the length of the text.
Finally, we stack our background video and subtitles video along with our music track into our final karaoke style video and save it to disk.
You can find the complete code at this GitHub repository. Feel free to fork the project and add your own features.
By setting up this custom karaoke solution, I’ve ensured that I’ll always have the perfect song ready to match my vocal range. Whether you’re also limited in your singing capabilities or just enjoy a custom karaoke experience, I hope you find this guide helpful. Happy singing!
Helpful Tips
If you’re looking to download source material from the internet, yt-dlp
is a great tool that can extract and save videos locally from a wide variety of sources like youtube, twitter and more.
When I use yt-dlp
I typically add the flag -f b
which will default the highest resolution mp4. I hate dealing with .webm and .mkv files and the mp4 is good enough for our purposes.
For this project I created a local folder named inputs to store my source videos. It’s listed in the .gitignore file to keep those binary files out of source control.
cd inputs
yt-dlp https://video.site/watch?v=ABC123 -f b
You can choose any font on your system, but some fonts act a little wonky when being formatted with pango
.
Start with a short song! It’s much better to find out that you have a bug after a few seconds of processing than after several minutes.
Try updating the font to something unique. Try messing with the relative volumes of the tracks. Try changing the SRT to display the song title before the lyrics. This is a great project to use to improve your python skills!
Add a requirements.txt with the following content:
moviepy
demucs @ git+https://github.com/facebookresearch/demucs@e976d93ecc3865e5757426930257e200846a520a
openai-whisper
With this you can install all of your 3rd party packages in one command. Don’t forget to activate your virtual environment before you do!