I am afraid I do not think this is correct.
What I am thinking is if kentico renders an html tag (audio/video) then we won't run into this issue. Right now kentico renders an object tag
<object codetype="CMSInlineControl" height="45" style="display: none" type="media" width="300"><param name="ext" value=".mp3" /><param name="url" value="~/getattachment/Sharing-Private/Audio-Video-for-M2E/Dry-Cleaner-Tragic.mp3.aspx?lang=en-US" /></object>
Why not render an audio tag? This way we will not have to train any front end clients to enter their own audio tag.