This made me think about a music-only application of hyper audio and captioning - synchronizing music notation with a recorded performance. How would it be a different thing than synchronizing lyrics with music, or a transcription with a movie?
Notation sometimes contains a complete map of the performance. Everything is written up in some way, including the intro, solos, and all the little bits. This style of notation gets out of sync with a performance easily. It would fit with performances read from that score or with scores transcribed from a performance. Automatically matching the written part to the recording would still be tricky, because of how the tempo affects them both.
Notation is often deliberately incomplete. It describes certain highlights: here's the main melody, here's the order of the parts, here's the guitar solo. This kind of notation would be fragments inserted at multiple different points.
But then there's the issue that notation is not an internet standard. As far as the Internet as a whole is concerned, relatively open approaches like MusicXML and the Lilypond format are just as opaque as a bitmap of a scan of a handwritten score.
And to the point of WebVTT, the stuff it's synchronizing with the media is text. It's not for random binary objects as far as I know.
Whatever the obstacles, it's plainly useful to synch written music with recorded performances. You might be annotating the performance to make a point to musicians. You might be illustrating the notation to make it easier to sight read. You might be enabling web search for recordings of a melody.
This idea seemed fairly bland when I sat down to write this. Not so much at this point.