Jin Daily AI Trivia: Did GPT-5 Train on Pirated Drama Subtitles?
Jin Daily AI Trivia: Did GPT-5 Train on Pirated Drama Subtitles?
Some users have noticed a strange quirk with GPT-5’s Speech-to-Text feature: if you hit transcribe but stay silent for a few seconds, it sometimes outputs the line — “The subtitle is provided by Amara.org community.”
So, what’s going on here?
Originally, OpenAI used their open-source Whisper model for transcription. Later, they upgraded to the much better (and impressively multilingual) gpt-4o-transcribe when launching the GPT-4o multimodal model.
Now with GPT-5, it seems they’ve rolled out a brand-new TTS model — but apparently without fully sanitizing the training data or checking edge cases. The result? When the model gets a few seconds of silence, it spits out that familiar subtitle watermark.
Why that line specifically? Well, many Chinese pirated dramas (complete with SRT subtitles) source their captions from the free Amara.org community. Those videos often display that exact line in the first few seconds, usually over silence. So the AI has essentially learned: “no audio at the start” = Amara.org credits.
This kind of glitch could’ve been caught with proper output checks before release… but maybe OpenAI was too busy firefighting routing issues and rushing GPT-5 out the door.
Hope you learn something new today!! See Ya!!
