We provide two csv files for download:
The format of the csv files is as following:
YouTube ID, start segment, end segment, X coordinate, Y coordinate
Where the X,Y coordinates mark the center point of the speaker's face in the frame at the beginning of the segment, normalized with respect to frame size, where (0.0, 0.0) corresponds to the top left, and (1.0, 1.0) corresponds to bottom right.
The train and test sets have disjoint speakers.
@article{ephrat2018looking, title={Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation}, author={Ephrat, A. and Mosseri, I. and Lang, O. and Dekel, T. and Wilson, K and Hassidim, A. and Freeman, W. T. and Rubinstein, M.}, journal={arXiv preprint arXiv:1804.03619}, year={2018} }
For discussions about the dataset, how to use and access it, etc., please check out our Google Group: avspeech-users.
This data is licensed by Google LLC under a Creative Commons Attribution 4.0 International License.