I saw this on a BBC science documentary (which I can’t find online) but did find a link to the underlying science.
As sound is simpler than image, it is processed faster by the brain, as measured by sprinters starter pistol vs flashing light.
That means when you watch someone speak, the sound arrives in your consciousness faster than the image, so the brain must spool - or delay - the sound to make it sync up with the image.
Because the brain must store sound in memory, this is one reason why you can tune into someone speaking after they say your name.