Comparative Experiment:
1)
Text-Only We select Llama 3-8B-chinese-chat as the standard base model, serving as a performance reference system. In this baseline, only the Spoken Dialog text is input, and the output consists of response text and its corresponding caption.
2)
ParalinGPT The core idea of ParalinGPT is to sequentially infer the emotional labels of the previous dialog history, the emotional label of the reply text, and then generate the reply text based on this information.
[PAPER]
3)
Spoken-LLM Spoken-LLM integrates a widely used open-source Llama 2-Chat with emotion2vec. This model can predict speaker style attributes and response text, enabling downstream expressive TTS models to generate natural and diverse voice responses.
[PAPER]
4)
PerceptiveAgent PerceptiveAgent introduces speech captions to assist with response text prediction. However, these speech captions only contain basic acoustic information such as pitch and energy.
[PAPER]
5)
OSUM-EChat OSUM-EChat is an end-to-end empathetic spoken dialogue system that enhances empathy through an understanding-driven training framework. It employs a three-stage strategy that extends the capabilities of large speech understanding models to spoken dialogue tasks and adopts a linguistic–paralinguistic dual-thinking mechanism to integrate acoustic and linguistic cues for empathetic response generation. To comprehensively assess its reasoning ability, we evaluate both the CoT version, which explicitly incorporates intermediate paralinguistic reasoning steps, and the non-CoT version, which generates responses directly without such reasoning.
[PAPER]
6)
LaMA-Omni2 LaMA-Omni2 is an LLM-based real-time spoken dialogue system designed for high-quality speech interaction. Built upon the Qwen2.5 series, it integrates a speech encoder and an autoregressive streaming speech decoder, enabling seamless speech understanding and generation within a unified framework.
[PAPER]
7)
Qwen2-Audio Qwen2-Audio is an open-source, largescale audio–language foundation model that jointly encodes acoustic and linguistic information. It was explicitly optimized for speech emotion recognition through multi-task pre-training followed by supervised fine-tuning on affect-annotated corpora, thereby endowing it with robust paralinguistic understanding capabilities.
[PAPER]
To facilitate the generation of comparative audio experiments, we used the same ground truth text combined with respective emotional captions or emotion labels to generate audio through CosyVoice. This approach ensures consistent input conditions for the comparative experiments, making it easier to evaluate the differences in emotional expression across models.