The AVSpeech dataset is a large collection of video clips of single speakers talking with no audio background interference. The dataset is based on public instructional YouTube videos (talks, lectures, HOW-TOs), from which we automatically extracted short, 3-10 second clips, where the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. Below is a "small" sample of 10,000 clips from the dataset. If you click on a video you will see just a segment from the video on YouTube that is included in the dataset.
You can learn more about the dataset construction in our SIGGRAPH 2018 paper.

Google Google About Google Privacy Terms Feedback