this post was submitted on 11 Jun 2025
792 points (99.0% liked)

People Twitter

10090 readers
497 users here now

People tweeting stuff. We allow tweets from anyone.

RULES:

  1. Mark NSFW content.
  2. No doxxing people.
  3. Must be a pic of the tweet or similar. No direct links to the tweet.
  4. No bullying or international politcs
  5. Be excellent to each other.
  6. Provide an archived link to the tweet (or similar) being shown if it's a major figure or a politician. Archive.is the best way.

founded 3 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] Knock_Knock_Lemmy_In@lemmy.world 2 points 1 year ago (1 children)

I'm highlighting that speech to text and context awareness are different skills.

YouTube is unlikely to waste loads of compute power on subtitles that don't need it just to capture the occasional edge case.

[–] lime@feddit.nu 2 points 1 year ago (2 children)

i mean, it's a one-time-per-video thing. they already do tons of processing on every upload.

[–] Knock_Knock_Lemmy_In@lemmy.world 3 points 1 year ago (1 children)

So if you can reduce compute there then you save money.

There is no technical difficulty. It's a business decision.

[–] lime@feddit.nu 2 points 1 year ago (1 children)

right now they're dynamically generating subtitles every time. that's way more compute.

[–] aow@sh.itjust.works 1 points 1 year ago (1 children)

For real? That's incredibly dumb/expensive compared to one subtitle roll. Can you share where you saw that?

[–] lime@feddit.nu 1 points 1 year ago* (last edited 1 year ago)

well, i have no evidence of this. however. looking at the way auto-generated subtitles are served at youtube right now, they are sent individually word-by-word from the server, pick up filler words like "uh", and sometimes pause for several seconds in the middle of sentences. and they're not sent by websocket, which means they go through multiple requests over the course of a video. more requests means the server works harder because it can't just stream the text like it does the video, and the only reason they'd do that other than incompetence (which would surely have been corrected by now, it's been like this for years) is if the web backend has to wait for the next word to be generated.

i would love to actually know what's going on if anyone has any insight.

[–] starchylemming@lemmy.world 2 points 1 year ago

it would be an improvement. thats not what we are doing anymore

new tech is there to make everyone more miserable