Co-speech gesture generation is crucial for automatic digital avatar
ani...
This survey paper provides a comprehensive overview of the recent
advanc...
Sequential recommender systems have achieved state-of-the-art recommenda...
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
...
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims t...
Scaling text-to-speech to a large and wild dataset has been proven to be...
We are interested in a novel task, namely low-resource text-to-talking
a...
Diffusion models have demonstrated impressive performance in text-to-ima...
Large diffusion models have been successful in text-to-audio (T2A) synth...
Direct speech-to-speech translation (S2ST) has gradually become popular ...
Direct speech-to-speech translation (S2ST) aims to convert speech from o...
Stutter removal is an essential scenario in the field of speech editing....
Improving text representation has attracted much attention to achieve
ex...
Human motion generation aims to produce plausible human motion sequences...
Generating talking person portraits with arbitrary speech audio is a cru...
Large language models (LLMs) have exhibited remarkable capabilities acro...
Generative models have enabled the creation of contents that are
indisti...
Generally speaking, the model training for recommender systems can be ba...
Listening to long video/audio recordings from video conferencing and onl...
ICASSP2023 General Meeting Understanding and Generation Challenge (MUG)
...
It is a well-known challenge to learn an unbiased ranker with biased
fee...
The gap between the randomly initialized item ID embedding and the
well-...
We see widespread adoption of slate recommender systems, where an ordere...
In deep learning, transferring information from a pretrained network to ...
Generating photo-realistic video portrait with arbitrary speech audio is...
Large-scale multimodal generative modeling has created milestones in
tex...
Robotic grasping is a fundamental ability for a robot to interact with t...
Natural language interfaces (NLIs) enable users to flexibly specify
anal...
Dance-driven music generation aims to generate musical pieces conditione...
The performance of a camera network monitoring a set of targets depends
...
Out-of-distribution (OOD) detection is an important task to ensure the
r...
Video to sound generation aims to generate realistic and natural sound g...
Designing safety-critical control for robotic manipulators is challengin...
While deep generative models have empowered music generation, it remains...
In this paper, we introduce a new task, spoken video grounding (SVG), wh...
In this paper, we introduce DA^2, the first large-scale dual-arm
dexteri...
Denoising diffusion probabilistic models (DDPMs) have recently achieved
...
Polyphone disambiguation aims to capture accurate pronunciation knowledg...
Embedding MLP has become a paradigm for modern large-scale recommend...
Direct speech-to-speech translation (S2ST) systems leverage recent progr...
Style transfer for out-of-domain (OOD) speech synthesis aims to generate...
This paper follows cognitive studies to investigate a graph representati...
The recent progress in non-autoregressive text-to-speech (NAR-TTS) has m...
Collision avoidance is a widely investigated topic in robotic applicatio...
Better-supervised models might have better performance. In this paper, w...
We are interested in a novel task, singing voice beautifying (SVB). Give...
Non-autoregressive text to speech (NAR-TTS) models have attracted much
a...
Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quali...
Generative models are now capable of synthesizing images, speeches, and
...
Expressive text-to-speech (TTS) has become a hot research topic recently...