HTML5 Audio Karaoke - a JavaScript audio text aligner

By John Dyer on June 1, 2012March 5, 2020

What it Does

Based on some amazing work by my friend Weston Ruter, I’ve put together a little library that mashes together

some text (usually some HTML)
an audio source reading that text (usually an mp3)
a timing file (in this case, generated by CMU Sphinx)

The result is that when you press “play” the words are highlighted as they are read, and you can click on words to navigate through the audio. The magic comes from data produced by the CMU Sphinx library (based on Weston’s work) which creates the word timing information.
I put together two demo versions, one of Martin Luther King, Jr.’s I Have A Dream speech and another one of the English Bible using the English Standard Version which has as great API. Unfortunately, the MLK speech didn’t align very well so the demo isn’t very good other than as an example of how dependent the process is on a good alignment.

(note: right now it’s Chrome/Safari/IE9 only since it requires MP3 playback)

How it Works

Although I wanted to use a “standard” format like WebVTT, I also wanted the filesize to be compact since my intended project involved large datasets of 48 hours or more of audio (i.e. the Bible). So here’s the basic JSON format:

{"words":[
 ["in",0.03,0.18],
 ["the",0.18,0.28],
 ["beginning",0.28,0.88],
 ["god",0.88,1.35],
 ["created",1.35,1.93]
]}

Basically, it’s just an array of words with a start and end time. The array of arrays format is quite a bit smaller than using JSON and doesn’t require any processing like WebVTT (although that might change later). It would take quite a bit of time to produce something like this by hand, but Weston used the CMU Sphinx library to generate this data, and it’s probably been about 90% accurate for the entire ESV Bible.
Once all the data is loaded, the AudioAligner class searches through a DOM node for the words in the array, skipping over classes or tags you define, and then links those words to the audio player.

Demo

Again, the demo I put together utilizes the API provided by the creators of the English Standard Version (ESV) of the Bible. The API allows developers to request the text and the MP3 and then this is mashed up with the timing files generated with SMU Sphinx.
HTML5 Karoke Demo
If anyone’s interested in the library, please let me know in the comments and I’ll post it to Github.

18 thoughts on “HTML5 Audio Karaoke – a JavaScript audio text aligner”

Ryan says:

June 2, 2012 at 4:23 am

I’m definitely interested and I’m looking to participate to help on some Bible web/app projects
Ryan says:

June 2, 2012 at 4:24 am

btw, this is awesome 🙂
Winston Fassett says:

June 5, 2012 at 7:43 am

Very cool! Yesterday I was reading up on web audio and ran across an experiment by the author of jPlayer that had some similarities, but it was doing manual audio syncing. I can’t speak to the underlying code, but the demo was fun to fiddle with, particularly using the text to navigate or as a soundboard, and the visualization bit was also nice.
http://happyworm.com/blog/2010/12/05/drumbeat-demo-html5-audio-text-sync/
Jay says:

June 22, 2012 at 2:27 pm

Yes! Please post the code to Github. I can see this being very useful for playing hymns + words – a hymn karaoke, sort of. Do you have any idea if CMU Shpinx works on other languages?
Thanks.
Alan McCann says:

June 25, 2012 at 2:41 pm

Hi:
Very cool. Would you mind sharing on github?
Thanks
Alan
Mark Boas says:

June 26, 2012 at 10:45 am

Am I interested? AM I? This is amazing – I’m all over it. In fact I wanted to do something myself using CMU Sphinx. Please do put in on github – great work and thanks!
RiaanP says:

September 12, 2012 at 4:50 am

I totally hear you on the timing side of things.. phew, we blew a massive amount of money last year on R&D to build this very tool in Flex.. we basically tried to use Flex to analyse the audio graph and “cleverly” plot the words as it heard it in a fashion where you could then “make minor adjustments” to the plotted words on the audio graph.. needless to say, it is a VERY hard thing to get right and we eventually canned it after trying out existing hardware accelerated timing apps. We did end up using it for some client work, but it was so frustrating to work with. You can see it in action here: http://www.readright.co.za/stories/2009/11/jasper-an-outing-to-the-aquarium-read-along/
So, note to all, this line is gospel: The magic comes from data produced by the CMU Sphinx library (based on Weston’s work) which creates the word timing information.
Roy says:

October 30, 2012 at 3:24 am

Would love to play with this! Did you get a chance to post it to github?
Thanks!
Jayant Rimza says:

October 8, 2014 at 12:58 am

I am very much interested in library.
Gary kuper says:

July 30, 2015 at 12:02 pm

I am very interested in your library and would love to work with the source code for a project of mine. Will you be making the source code available to the public? I look forward to hearing from you and learning more about the possibilities of this tool. Thank you. Sincerely, Gary.
Jose Eduardo says:

January 8, 2016 at 5:44 am

Hello, I’m very interested in you project. Can you make it available to me? I would like to use in a project of mine.
Amaechi Desmond says:

January 9, 2016 at 11:58 am

I am very interested in your library, how do i get it from github?
Vick says:

March 18, 2016 at 6:24 am

Amen brother, please send me the link on github with the library…thanks a bunch, Vick
Raúl says:

May 13, 2016 at 4:15 am

Please brother send me the github link its incredible.
Pooja Donekal says:

June 4, 2016 at 5:59 pm

This is very helpful for one of the projects I am working on. Can you please provide me with the link?
trying says:

November 30, 2016 at 6:01 pm

Here’s the Github link – https://github.com/johndyer/audiosync
Gregory Werking says:

November 30, 2016 at 6:12 pm

I would like to develop a website where users can upload their audio prayers to be listened to by later users as a online group prayer session, The text and audio of the prayers scroll across the screen as multiple international users enunciate the words of the prayer at the same time based on the cadence provided by the karaoke api. I need to develop an algorithm that will automatically determine if the upload is not malicious, accurate, and safe for a holy website, What are your thoughts?
Greg
Fropt says:

December 2, 2016 at 9:18 pm

Hi, I’m interested in your solution. I teach music and I’m looking for tools that help me out with getting better results.
Could you share your code please?
Thanks!

Comments are closed.