Empathetic Spoken Conversation by Cascading Text and Speech LLMs with Speech Emotion Caption

Leveraging Conversational Emotional Chain-of-Decoding Towards Empathetic Spoken Dialogue

ABSTRACT

Spoken dialogue (SD), which involves generating empathetic responses during human-machine conversations (HMC), is becoming increasingly prevalent. As speech is the most natural form of HMC, state-of-the-art SD systems aim to integrate speech modalities into the SD process to develop agents capable of listening, understanding, and interacting effectively. However, current methods often lack a comprehensive understanding of the multimodal dialogue context, leading to limited empathy in both the generated responses and synthesized speech. To address this limitation, we propose \textbf{CECoD-Spoken}, a Conversational Emotional Chain-of-Decoding (CECoD) driven architecture implementing hierarchical empathetic reasoning through five cognitive phases: 1) Dialogue Emotional Clues Understanding, 2) Response Text Generation, 3) Response Speech Utterance-Level Empathetic Understanding, 4) Response Speech Word-Level Empathetic Understanding, and 5) Response Speech Rendering. Through this architecture, the system progressively refines understanding from dialogue-level empathy to utterance-specific emotional resonance. Extensive experiments show that the proposed CECoD-Spoken significantly enhances the empathy of textual responses and the emotional expressiveness of generated speech, providing an innovative solution for developing intelligent, human-centric, emotion-driven dialogue systems..

MODEL ARCHITECTURE

The overview of CECoD-Spoken. Dialogue Semantics-Emotion Tokenization seeks to obtain the discrete representation for given dialogue history. Spoken Dialogue LLM is treated as the dialogue backbone. CECoD Driven Response Generation aims to predict the response text and speech, in a serialized manner, by leveraging the novel Conversational Emotional Chain of Decoding

EXPERIMENTS

Comparative Experiment:
1) Text-Only We select Llama 3-8B-chinese-chat as the standard base model, serving as a performance reference system. In this baseline, only the Spoken Dialog text is input, and the output consists of response text and its corresponding caption.
2) ParalinGPT The core idea of ParalinGPT is to sequentially infer the emotional labels of the previous dialog history, the emotional label of the reply text, and then generate the reply text based on this information. [PAPER]
3) Spoken-LLM Spoken-LLM integrates a widely used open-source Llama 2-Chat with emotion2vec. This model can predict speaker style attributes and response text, enabling downstream expressive TTS models to generate natural and diverse voice responses. [PAPER]
4) PerceptiveAgent PerceptiveAgent introduces speech captions to assist with response text prediction. However, these speech captions only contain basic acoustic information such as pitch and energy. [PAPER]
5) OSUM-EChat OSUM-EChat is an end-to-end empathetic spoken dialogue system that enhances empathy through an understanding-driven training framework. It employs a three-stage strategy that extends the capabilities of large speech understanding models to spoken dialogue tasks and adopts a linguistic–paralinguistic dual-thinking mechanism to integrate acoustic and linguistic cues for empathetic response generation. To comprehensively assess its reasoning ability, we evaluate both the CoT version, which explicitly incorporates intermediate paralinguistic reasoning steps, and the non-CoT version, which generates responses directly without such reasoning. [PAPER]
6) LaMA-Omni2 LaMA-Omni2 is an LLM-based real-time spoken dialogue system designed for high-quality speech interaction. Built upon the Qwen2.5 series, it integrates a speech encoder and an autoregressive streaming speech decoder, enabling seamless speech understanding and generation within a unified framework. [PAPER]
7) Qwen2-Audio Qwen2-Audio is an open-source, largescale audio–language foundation model that jointly encodes acoustic and linguistic information. It was explicitly optimized for speech emotion recognition through multi-task pre-training followed by supervised fine-tuning on affect-annotated corpora, thereby endowing it with robust paralinguistic understanding capabilities. [PAPER]

To facilitate the generation of comparative audio experiments, we used the same ground truth text combined with respective emotional captions or emotion labels to generate audio through CosyVoice. This approach ensures consistent input conditions for the comparative experiments, making it easier to evaluate the differences in emotional expression across models.

Predict the reply and caption of the last sentence through the following conversation history

我听说市区最近要进行交通建设，你知道具体是什么项目吗？

I heard that there will be some traffic construction in the urban area recently. Do you know what kind of projects they are?

是的，最近市政府计划修建一条地铁线路，以缓解交通拥堵问题。

Yes, recently the municipal government plans to build a subway line to alleviate the traffic congestion problem.

那真是太好了呢。地铁能够提供更方便的交通方式，减少私家车出行，对环境也有好处呢。

That's really great. The subway can provide a more convenient mode of transportation, reduce private car usage, and is beneficial to the environment as well.

没错，地铁建设不仅能够方便市民出行，还能减少道路上车辆数量，以降低交通事故发生率。嗯。

Exactly. The construction of the subway not only facilitates citizens' travel but also reduces the number of vehicles on the roads, thereby lowering the incidence of traffic accidents. Um.

不过我担心修建地铁会给周边居民带来噪音和成都扰民的问题，这个问题有没有被考虑到啊？

However, I'm worried that the construction of the subway might cause noise and disturbance problems for the surrounding residents. Has this issue been taken into consideration?

市政府已经考虑到这个问题了，他们会采取相应的措施，以减少施工对周边居民的影响。

The municipal government has already considered this issue and they will take corresponding measures to reduce the impact of the construction on the surrounding residents.

嗯，那就好，我希望地铁的建设能够顺利进行，尽快为市民提供更好的出行条件呢。

Well, that's good. I hope the construction of the subway can proceed smoothly and provide better travel conditions for the citizens as soon as possible.

......

Comparative	Predicted Response	Caption
Ground Truth	我也是这样希望的。交通建设对城市发展和居民生活质量提升都起到了重要作用。 I hope so too. The construction of transportation infrastructure plays an important role in urban development and improving the quality of life for residents.	这位女性声音洪亮有力，语速适中，充满热情。她的语气积极向上，表达了对交通建设的高度支持和满怀期待的情感。 This woman's voice was loud and powerful, with a moderate pace and enthusiasm. Her tone was positive and upward, expressing her high support and anticipation for the transportation construction.
CECoD-Spoken	嗯，我完全同意，这对我们每个人来说都会有好处。我希望市政府能够在实施过程中注重细节，确保一切顺利。 Well, I completely agree. This will be beneficial for everyone. I hope the municipal government can pay attention to details during the implementation process and ensure everything goes smoothly.	一位女性以坚定的声音回应，音高和音量适中，语速适中，情感中带着对未来发展的希望。 A woman responded with a firm voice, with moderate pitch and volume, and a moderate pace, her emotions carrying a hope for future development.
Text-Only	我觉得这种虚拟的成功和荣誉都没有实际价值，不如在现实世界里做点什么实事。 I think such virtual success and honors have no practical value. It's better to do something practical in the real world.	这位女孩的声音温柔且稳重，语速适中，语调中带有一种深思熟虑的意味，令人觉得她的观点深刻而有说服力 This girl's voice was gentle and steady, with a moderate pace, and a tone that carried a thoughtful meaning, making her viewpoint profound and persuasive.
ParalinGPT	确实，各有各的观点，我们可以通过交流来了解彼此的想法，不必强迫自己去接受对方的看法。 Indeed, everyone has their own viewpoints. We can understand each other's thoughts through communication and don't have to force ourselves to accept each other's views.	<消极> Negative
Spoken-LLM	当然，通过教育来预防交通事故，远比处理事故后的后果要好得多。我们应该共同努力，让我们的道路更加安全，减少伤亡的发生。 Of course, preventing traffic accidents through education is much better than dealing with the consequences after accidents. We should work together to make our roads safer and reduce the occurrence of casualties.	<中立> Neutral
PerceptiveAgent	确实希望能这么做，毕竟我们都在这座城市里生活和工作，一个好的公共交通系统对于城市居民来说是多么重要！ It's true that we should do this. After all, we all live and work in this city. A good public transportation system is extremely important for urban residents!	女性的声音,音高较强，语速适中，语气中透着好奇和疑问，仿佛在探索未知却带着几分谨慎思考 The woman's voice was strong and clear, with a moderate pace, her tone carrying curiosity and doubt, as if exploring the unknown while being cautious in thinking.
OSUM-EChat (non-CoT Version)	亲爱的，听你这么想地铁建设，真是让人感到温暖呢！希望你的期待能变成现实，让大家都享受到更便捷的出行方式～ Wow, it’s so nice to hear how much you care about the subway plans! Hope your wish comes true and travel gets easier for everyone.	N/A
OSUM-EChat (CoT Version)	听姐姐这么一说，感觉你对城市发展挺有想法的呢！地铁建设确实关系到大家的生活便利，能感受到你为市民着想的心情，真棒～ Sounds like you really care about city development! It’s great to see you thinking about what helps people’s daily lives.	N/A
LLaMA-Omni2	为了使城市交通更加便捷，可以采取以下措施：1. 扩建公共交通网络；2. 实施智能交通系统；3. 增加自行车道和步行设施；4. 优化道路设计与管理。 To make city transport smoother, we could: 1) expand public transit; 2) use smart traffic systems; 3) add bike lanes and walkways; 4) improve road design.	N/A
Qwen2-Audio	地铁建设不仅方便市民出行，还能减少车辆数量，降低交通事故发生率，可谓一举多得。 Building subways makes travel easier and cuts down on cars and accidents — truly a win-win.	N/A

Ground Truth	CECoD-Spoken	Text-Only	ParalinGPT	Spoken-LLM	OSUM-EChat (non-CoT Version)	OSUM-EChat (CoT Version)	LLaMA-Omni2	Qwen2-Audio

Predict the reply and caption of the last sentence through the following conversation history

我真的很生气，我们的旅游体验太糟糕了。

I'm really furious. Our travel experience was just awful.

是吗？怎么了？

Really? What's wrong?

我们订的酒店糟糕透了，设施陈旧，服务态度还差。

The hotel we booked was extremely bad. The facilities were old-fashioned and the service attitude was poor.

真的吗？我也没有想到这么糟糕。

Really? I didn't expect it to be so bad.

而且今年导游也不专业，讲解简单粗暴，一点也不耐心啊。

Moreover, the tour guide this year was not professional either. His explanations were simple and rough, and he was not patient at all.

这真的让人失望，我们花这么多的钱，希望能够得到更好的服务和体验。

This is really disappointing. We spent so much money, and we hoped to get better services and experiences.

没错，我觉得我们应该向旅行社投诉，要求他们给我们一个解决方案。

Yes, I think we should complain to the travel agency and ask them to give us a solution.

嗯，你说的对我不该忍受这样的糟糕体验，应该让他们知道我我们的不满。

Well, you're right. I shouldn't have endured such a bad experience. I should let them know about our dissatisfaction.

一定要让他们意识到他们的问题，以后不要再让其他人也受到这样的伤害了。

We must make them realize that there are problems with them. We shouldn't let others suffer the same harm as us in the future.

......

Comparative	Predicted Response	Caption/Laber
Ground Truth	我们可以写封信投诉，详细描述我们的遭遇，希望他们给予合理的解决方案。 We can write a letter to complain and describe our experience in detail, hoping they can give us a reasonable solution.	这位男性声音音高适中，语速较快，情绪充满激动与愤怒，语气中流露出怒不可遏的强烈情感，富有强烈的感染力。 This male voice has a moderate pitch, a fast speaking speed, and an excited and angry tone. The emotion is intense and passionate, and it has a strong impact.
CECoD-Spoken	嗯，我理解你的不满，我们应该共同努力，要求旅行社给我们一个满意的解释和补偿。 Well, I understand your dissatisfaction. We should work together to ask the travel agency to give us a satisfactory explanation and compensation.	这位男性以坚定而温和的语调，音量适中，语速较快，表现出同情和理解，和对问题的积极态度。 This male voice has a firm and gentle tone, a moderate volume, and a fast speaking speed. It shows sympathy and understanding, and an attitude of positive attitude towards the problem.
Text-Only	我真的很抱歉，没想到你会这么失望，我会加倍努力的。 I'm really sorry. I didn't expect you to be so disappointed. I will work harder.	语气中透露着内心的不满和不屑，夹杂着一丝无奈。 The tone reveals inner dissatisfaction and disdain, mixed with a touch of helplessness.
ParalinGPT	嗯，你说的有道理。 Well, you're right.	<消极> Negative
Spoken-LLM	当然，我已经向旅行社发邮件了，希望他们能够给我们一个公正的解释和合理的补偿。 Sure, I have sent an email to the travel agency, hoping they can give us a fair explanation and reasonable compensation.	<生气> Angry
PerceptiveAgent	好的，我们要团结起来，面对困难。 OK, we should unite and face the difficulties.	一位男性的声音，音调沉稳而充满力量，语速适中，语气中透露出坚定的信念与领导力，让人感到稳重而坚定，充满力量与支持。 A male voice, with a stable and powerful pitch, a moderate speaking speed, and an attitude that reveals firm belief and leadership, makes people feel stable and determined, full of strength and support.
OSUM-EChat (non-CoT Version)	姐妹，听你这么说，感觉心情都变好了呢！遇到问题不慌张，还想着想办法解决，真的很棒～ Girl, hearing you say that really lifts my mood! You stay calm and find solutions — love that!	N/A
OSUM-EChat (CoT Version)	姐妹，听你这么说，心里也跟着难过呢。遇到这种事确实让人委屈，不过别担心，投诉是解决问题的一种方式，相信旅行社会给你一个合理的答复的～ Oh girl, that sounds really upsetting. It’s okay to feel that way, but don’t worry — filing a complaint is the right move. The agency should give you a fair answer.	N/A
LLaMA-Omni2	可以，您有权向旅行社投诉并要求赔偿。建议先与旅行社沟通解决；若无法达成一致，可向消费者协会或旅游管理部门投诉。 Yes, you can file a complaint and ask for compensation. Try talking to the agency first, or contact the consumer office if that doesn’t work.	N/A
Qwen2-Audio	我们一定会向旅行社投诉，让他们给我们一个合理的解决方案。 We’ll definitely report it to the agency and make sure they give us a proper solution.	愤怒 anger

Ground Truth	CECoD-Spoken	Text-Only	ParalinGPT	Spoken-LLM	OSUM-EChat (non-CoT Version)	OSUM-EChat (CoT Version)	LLaMA-Omni2	Qwen2-Audio