Published On Apr 28, 2023
Original paper:
https://arxiv.org/pdf/2107.05604.pdf
Summary:
Uses Transformer encoder-decoder architecture
Uses Multi Headed Attention
Uses target phonemes for extra supervision
Predicts discrete units as the model output
Experiments include Spanish to English language translations
Published in 2022
Other resources:
/ advancing-direct-speech-to-speech-modeling...
What is self supervised learning?
https://arxiv.org/pdf/2304.12210.pdf
Code:
How to use model for training and evaluation: https://github.com/facebookresearch/f...
Code for the actual S2UT model: https://github.com/facebookresearch/f... (search for s2ut_transformer)
show more