It goes through the video analysis process and terminology step by step. You need to say "timecode" instead of "timestamp" so the LLM aligns better to AV-stuff more than programming (weird, right?)
Basically you need two to three passes to get a proper transcription
1. just listen and identify voices, give them some kind of ID
2. Maybe watch the video and see if the people have name tags on the table in front of them or there's an on-screen overlay (news or interviews)
3. Last pass, go through the transcription and map voices+ids to actual names
It goes through the video analysis process and terminology step by step. You need to say "timecode" instead of "timestamp" so the LLM aligns better to AV-stuff more than programming (weird, right?)
Basically you need two to three passes to get a proper transcription
1. just listen and identify voices, give them some kind of ID 2. Maybe watch the video and see if the people have name tags on the table in front of them or there's an on-screen overlay (news or interviews) 3. Last pass, go through the transcription and map voices+ids to actual names