Fast model detects explicit words → Large model refines only those segments. Returns word-level timestamps.