AVSpeech: Audio Visual Speech dataset

The AVSpeech dataset is a large collection of video clips of single speakers talking with no audio background interference. The dataset is based on public instructional YouTube videos (talks, lectures, HOW-TOs), from which we automatically extracted short, 3-10 second clips, where the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. Below is a "small" sample of 10,000 clips from the dataset. If you click on a video you will see just a segment from the video on YouTube that is included in the dataset.
You can learn more about the dataset construction in our SIGGRAPH 2018 paper.