The dataset

We provide two csv files for download:

  • train.csv (128MB) - A set of video segment annotations from 270k videos.
  • test.csv (9MB) - A set of video segment annotations from a separate set of 22k videos.

The format of the csv files is as following:

        YouTube ID, start segment, end segment, X coordinate, Y coordinate

Where the X,Y coordinates mark the center point of the speaker's face in the frame at the beginning of the segment, normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.

The train and test sets have disjoint speakers.

If you plan to use this dataset, please cite our paper.

  title={Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation},
  author={Ephrat, A. and Mosseri, I. and Lang, O. and Dekel, T. and Wilson, K and Hassidim, A. and Freeman, W. T. and Rubinstein, M.},
  journal={arXiv preprint arXiv:1804.03619},

Separation of speech and noise

To create the speech + noise mixtures, we used samples from AudioSet as the non-speech sounds. These sounds were taken randomly from videos which were not given the "Speech" label. Different sets of videos were taken for the train and test set.

For discussions about the dataset, how to use and access it, etc., please check out our Google Group: avspeech-users.


This data is licensed by Google LLC under a Creative Commons Attribution 4.0 International License.

